Skip to main content

TL;DR

Provide Cardinal with the schema you want to extract. We’ll return that schema populated with the matching values found in your document.

Endpoint

POST https://api.trycardinal.ai/split
Content-Type: multipart/form-data
Auth: X-API-KEY: <API_KEY>
You may provide either file or fileUrl.
Mode: Set fast to true (fast path) or false (standard path) Default mode is false. Use customContext (optional, string) to steer extraction with short, domain-specific hints.

Methods of Schema Extraction

We offer two modes on the same endpoint:

1) Fast Schema — /extract with fast: true

Use Fast when you need a quick, low-latency, lower-cost pass.
  • Prioritizes speed over deep parsing
  • Input: PDF or image (.pdf, .jpg, .jpeg, .png)
  • Provide your schema as a string

Example Schema (JSON Schema)

{
  "$schema": "https://json-schema.org/draft/2020-12/schema",
  "title": "BlurryStatus",
  "type": "object",
  "properties": {
    "is_blurry": {
      "type": "boolean",
      "description": "Indicates whether something is blurry"
    }
  },
  "required": ["is_blurry"]
}

Example Schema (Zod-style)

{
  "schema": "z.object({ is_blurry: z.boolean().describe('Indicates whether something is blurry') })"
}

Request (requests)

import requests
import json

schema = {
    "$schema": "https://json-schema.org/draft/2020-12/schema",
    "title": "BlurryStatus",
    "type": "object",
    "properties": {"is_blurry": {"type": "boolean"}},
    "required": ["is_blurry"]
}

data = {
    "fileUrl": "https://example.com/doc.pdf",
    "schema": json.dumps(schema),
    "fast": "true",
    "customContext": "Focus on image quality and clarity when determining blur status"  # optional
}

response = requests.post("https://api.trycardinal.ai/extract", data=data)
data = response.json()

Example Response

{
  "response": "{\"is_blurry\": false}",
  "method": "fast"
}

2) Standard Schema — /extract with fast: false

  • Runs the full parsing pipeline first, then aligns to your schema
  • Best for complex layouts (dense tables, annotations, multi-page forms)
  • Slower, but more reliable for production workloads
  • Input: PDF or image via upload or fileUrl

Example Schema (JSON Schema)

{
  "$schema": "https://json-schema.org/draft/2020-12/schema",
  "title": "Invoice",
  "type": "object",
  "properties": {
    "invoice_number": { "type": "string" },
    "due_date": { "type": "string", "format": "date" },
    "total_amount": { "type": "number" },
    "line_items": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "description": { "type": "string" },
          "amount": { "type": "number" }
        },
        "required": ["description", "amount"]
      }
    }
  },
  "required": ["invoice_number", "due_date", "total_amount", "line_items"]
}

Request (requests)

import requests
import json

schema = {
    "$schema": "https://json-schema.org/draft/2020-12/schema",
    "title": "Invoice",
    "type": "object",
    "properties": {
        "invoice_number": {"type": "string"},
        "due_date": {"type": "string", "format": "date"},
        "total_amount": {"type": "number"},
        "line_items": {
            "type": "array",
            "items": {
                "type": "object",
                "properties": {
                    "description": {"type": "string"},
                    "amount": {"type": "number"}
                },
                "required": ["description", "amount"]
            }
        }
    },
    "required": ["invoice_number", "due_date", "total_amount", "line_items"]
}

data = {
    "fileUrl": "https://example.com/invoice.pdf",
    "schema": json.dumps(schema),
    "fast": "false",  # standard mode
    "customContext": "This is a medical billing invoice. Pay attention to procedure codes and insurance information."  # optional
}

response = requests.post("https://api.trycardinal.ai/extract", data=data)
data = response.json()

Example Response

{
  "success": true,
  "response": "{\"invoice_number\":\"INV-00392\",\"due_date\":\"2025-09-01\",\"total_amount\":1042.50,\"line_items\":[{\"description\":\"Consulting\",\"amount\":900.00},{\"description\":\"Tax\",\"amount\":142.50}]}",
  "method": "slow",
  "pages_processed": 6,
  "confidence_score": 0.95
}

Supported Schema Formats

Pass your schema as a string in the schema field:
  • JSON Schema (stringified object)
  • Zod (string)
  • TypeScript (interface/type as string)
  • Pydantic (model definition as string)
  • OpenAPI (schema as string)
  • Custom (any structured format as string)

Custom Context

You can provide additional context to guide the extraction process using the optional customContext parameter:
  • Parameter: customContext (string, optional)
  • Purpose: Provides additional context or instructions to help the AI better understand your document or extraction requirements
  • Examples:
    • "This document contains medical terminology and abbreviations"
    • "Focus on financial data and ignore header/footer information"
    • "The document may contain handwritten notes in the margins"
    • "This is a form from 1995, so date formats may be non-standard"

API Usage Snippets

Python (generic)

import requests
import json

data = {
    "fileUrl": "https://example.com/doc.pdf",
    "schema": "z.object({ field: z.string().nullable() })",
    "fast": "true",  # or "false"
    "customContext": "Document contains technical jargon specific to aerospace industry"
}

response = requests.post("https://api.trycardinal.ai/extract", data=data)
result = response.json()

# `response` may be an object or a stringified JSON depending on config/model:
if isinstance(result["response"], str):
    extracted = json.loads(result["response"])
else:
    extracted = result["response"]

cURL

curl -X POST https://api.trycardinal.ai/extract \
  -H "x-api-key: YOUR_KEY" \
  -F fileUrl="https://example.com/doc.pdf" \
  -F schema='{"$schema":"https://json-schema.org/draft/2020-12/schema","type":"object","properties":{"field":{"type":"string"}}}' \
  -F fast=true \
  -F customContext="Pay special attention to dates and numerical values"

Response Format

/extract returns:
  • response – the extracted data (either an object or a JSON string)
  • method – “fast” or “slow”
  • pages_processed – present in slow mode
  • confidence_score – confidence score (0-100) indicating extraction reliability, present in slow mode
  • success – present in slow mode
If response is a string, parse it:
import json

if isinstance(result["response"], str):
    extracted = json.loads(result["response"])
else:
    extracted = result["response"]

Long Document Caveats

Longer files (especially with large arrays) may produce truncated arrays or partial results.

Tips

  • Use fast: false for complex, multi-page documents
  • Paginate large PDFs and run /extract per chunk, then merge arrays client-side
  • Leverage customContext to provide domain-specific guidance for better extraction accuracy

API Reference

Extract API