Dense PDFs

Overview

Cardinal detects and processes dense PDFs — cases where a document contains many small-text elements that are difficult to parse cleanly with normal extraction. When dense PDF detection is enabled, Cardinal splits the PDF into all of its individual table elements and extracts all the remaining text outside of those tables. Results are returned as structured elements under the processed_tables field along with the extracted content from non-table areas. For each table element, you’ll receive:

table_index – index of the table on the page.
page_number – page where the table appears.
bounding_box – normalized geometry including min_x, min_y, max_x, max_y, plus row/column counts and polygon coordinates.
row_count / column_count – detected table dimensions.
image_format – output crop format (e.g. png).
crop_coordinates – pixel coordinates for cropped table images.
dpi_x / dpi_y – resolution used for rendering.
text_content – extracted HTML/Markdown table text.
html_code – cleaned HTML version of the table (when available).

Example of dense PDF returned by the API

Example: Detected dense PDF and accompanying content

How to Enable

To enable dense PDF detection, set densePdfDetect: true in your API request.

Default: densePdfDetect = false
To enable dense PDF detection, set densePdfDetect = true.
⚠️ Enabling dense PDF detection will add latency to your requests, since additional detection passes are run.

Example Response (excerpt)

{
  "processed_tables": [
    {
      "table_index": 0,
      "page_number": 1,
      "bounding_box": {
        "page_number": 1,
        "min_x": 0.284,
        "min_y": 0.8275,
        "max_x": 3.6697,
        "max_y": 2.4552,
        "row_count": 26,
        "column_count": 10,
        "polygon": [0.284, 0.8275, 3.6687, 0.8282, 3.6697, 2.4552, 0.2852, 2.4547]
      },
      "row_count": 26,
      "column_count": 10,
      "image_format": "png",
      "crop_coordinates": {
        "left": 85,
        "top": 248,
        "right": 1100,
        "bottom": 736,
        "width": 1015,
        "height": 488
      },
      "dpi_x": 300.0,
      "dpi_y": 300.0,
      "text_content": "<table> ... </table>",
      "html_code": ""
    }
  ]
}

Why Dense PDF Detection?

Some engineering, financial, and architectural documents contain dozens of tiny tables (e.g., schedules, parts lists, cell plans) that break normal Markdown extraction. Dense PDF detection solves this by:

Isolating each table element: Splits the PDF into individual table components with precise bounding boxes
Extracting non-table content: Captures all remaining text outside of the identified table areas
Providing structured output: Delivers both cropped table images for visual reference and extracted HTML/Markdown text for structured parsing

Dense PDF extraction requires the densePdfDetect parameter to be enabled in your API request. Results appear in the JSON output under processed_tables along with extracted content from non-table areas.

Introduction

Building Blocks

Accessories

Eval

Common Questions

Recipes

Security

On-Premise VPC Deployment

Uptime

Changelog

Overview

How to Enable

Example Response (excerpt)

Why Dense PDF Detection?

Introduction

Building Blocks

Accessories

Eval

Common Questions

Recipes

Security

On-Premise VPC Deployment

Uptime

Changelog

​Overview

​How to Enable

​Example Response (excerpt)

​Why Dense PDF Detection?

Overview

How to Enable

Example Response (excerpt)

Why Dense PDF Detection?