Skip to main content

TL;DR

Send an array of descriptions for what you want (e.g., “invoices”, “contracts”, “account number xyz”). Cardinal returns which pages match each description. Note: Each page is matched to at most one partition type for accuracy purposes.

Endpoint

POST https://api.trycardinal.ai/split
Content-Type: multipart/form-data
Auth: X-API-KEY: <API_KEY>
You may provide either file or fileUrl.

Required parameters

  • queries (string) — JSON-encoded array of query objects (see format below)
  • file (file upload) OR fileUrl (string)

Optional parameters

  • customContext (string) — Additional context to improve classification accuracy and help the model understand domain-specific terminology. Use this to provide background information about the document type, industry, or specific terminology that will help with more accurate page classification.

Query object format

Each query is:
  • name (string) — label you’ll see in the response (e.g., "invoices")
  • description (string, optional) — natural-language hint used to find relevant pages
Example queries value
[
  {"name":"invoices","description":"Pages with invoice numbers, totals due, remittance sections"},
  {"name":"contracts","description":"Legal agreements with parties, terms, signatures"},
  {"name":"financial_statements","description":"Balance sheets, income statements, cash flow tables"}
]

Example requests

import json, requests

queries = [
    {"name": "cover_pages", "description": "Title pages or covers"},
    {"name": "data_tables", "description": "Pages with structured tables"},
    {"name": "appendices", "description": "Supplemental materials or references"}
]

resp = requests.post(
    "https://api.trycardinal.ai/split",
    headers={"Authorization": "Bearer YOUR_API_KEY"},
    files={"file": open("quarterly-report.pdf", "rb")},
    data={"queries": json.dumps(queries)}
)
print(resp.json())

Example Response

{
  "success": true,
  "pages": [
    {
      "content": "Page 1 content...",
      "page_number": 1
    },
    {
      "content": "Page 2 content...",
      "page_number": 2
    }
  ],
  "partitions": [
    {
      "name": "cover_pages",
      "description": "Title pages, cover sheets, or document headers",
      "pages": [1]
    },
    {
      "name": "data_tables", 
      "description": "Pages with structured data, tables, or numerical information",
      "pages": [3, 4, 7, 8]
    },
    {
      "name": "appendices",
      "description": "Supporting documents, references, or supplementary materials", 
      "pages": [9, 10, 11]
    }
  ]
}

Response Format

The response includes:
  • success - Boolean indicating if partitioning completed successfully
  • pages - Array of page objects with content and metadata
  • partitions - Array of partition results, each containing:
    • name - The partition name from your query
    • description - The partition description from your query
    • pages - Array of page numbers that match this partition (sorted)

Supported File Types

This endpoint supports:
  • PDF files (.pdf)
  • Images (.jpg, .jpeg, .png)

Writing Effective Queries

Good query descriptions:
  • Be specific about content type: “Financial tables with revenue data”
  • Include context clues: “Pages with signatures or sign-off sections”
  • Mention visual indicators: “Charts, graphs, or data visualizations”
Less effective queries:
  • Too vague: “Important pages”
  • Overly restrictive: “Page 5 specifically about Q3 sales in the northeast region”

Under the Hood

We run the document through our full Markdown pipeline first, converting it into a precise text representation. Only then do we split, ensuring the results are consistent and accurate.

Extraction Endpoint

POST https://api.trycardinal.ai/split/extract
Content-Type: multipart/form-data
Auth: X-API-KEY: <API_KEY>
After classifying pages with /split, use /split/extract to download specific pages as a separate PDF.

Required parameters

  • pages (string) — JSON-encoded array of page numbers (e.g., "[1, 3, 5, 7]")
  • file (file upload) OR fileUrl (string)

Example requests

import json, requests

# First, get page classifications from /split
split_resp = requests.post(
    "https://api.trycardinal.ai/split",
    headers={"X-API-KEY": "YOUR_API_KEY"},
    files={"file": open("document.pdf", "rb")},
    data={"queries": json.dumps([
        {"name": "invoices", "description": "Invoice pages"}
    ])}
)

partitions = split_resp.json()["partitions"]
invoice_pages = next(p["pages"] for p in partitions if p["name"] == "invoices")

# Extract those pages as a new PDF
extract_resp = requests.post(
    "https://api.trycardinal.ai/split/extract",
    headers={"X-API-KEY": "YOUR_API_KEY"},
    files={"file": open("document.pdf", "rb")},
    data={
        "pages": json.dumps(invoice_pages),  # e.g., [1, 3, 5]
        "filename": "invoices.pdf"
    }
)

# Save the extracted PDF
with open("invoices.pdf", "wb") as f:
    f.write(extract_resp.content)

Response Format

The /split/extract endpoint returns a PDF file directly.
  • Content-Type: application/pdf
  • Content-Disposition: attachment; filename="<your-filename>.pdf"
To save the extracted PDF, simply write the response body to a .pdf file.

API Reference

Split API