JSON Schema
Top-Level Structure
Section titled “Top-Level Structure”The --format json output uses a flat kids array:
{ "file name": "document.pdf", "number of pages": 12, "author": "string | null", "title": "string | null", "creation date": "string | null", "modification date": "string | null", "kids": [Element]}Element Object
Section titled “Element Object”Each element in kids has a type field identifying its kind:
| Field | Type | Description |
|---|---|---|
type | string | "heading", "paragraph", "table", "list", "image", "caption", "header", "footer" |
id | int | Globally unique sequential ID |
page number | int | 1-based page index |
bounding box | [left, bottom, right, top] | Coordinates in PDF points |
content | string | Extracted text (present on paragraph, heading, caption, list item; absent on table, image, list, header, footer) |
font | string | Font name (text elements only) |
font size | float | Font size in points (text elements only) |
Heading-Specific Fields
Section titled “Heading-Specific Fields”| Field | Type | Description |
|---|---|---|
level | string | Semantic label: "Title", "Subtitle", "Heading1", "Heading2", "Heading3", "Heading4" |
heading level | int | Numeric heading level (1–6) |
Table-Specific Fields
Section titled “Table-Specific Fields”| Field | Type | Description |
|---|---|---|
rows | Row[] | Array of row objects |
Row Object
Section titled “Row Object”| Field | Type | Description |
|---|---|---|
row number | int | 0-based row index |
cells | Cell[] | Array of cell objects |
Cell Object
Section titled “Cell Object”| Field | Type | Description |
|---|---|---|
type | string | Always "table cell" |
row number | int | 1-based row index |
column number | int | 1-based column index |
row span | int | Number of rows spanned (default: 1) |
column span | int | Number of columns spanned (default: 1) |
kids | Element[] | Nested child elements (typically empty) |
Full Example
Section titled “Full Example”{ "file name": "report.pdf", "number of pages": 1, "author": null, "title": "Quarterly Report", "creation date": "2024-01-15", "modification date": null, "kids": [ { "type": "heading", "id": 1, "level": "title", "heading level": 1, "page number": 1, "bounding box": [72, 90, 540, 120], "font": "Helvetica-Bold", "font size": 24.0, "content": "Quarterly Report" }, { "type": "paragraph", "id": 2, "page number": 1, "bounding box": [72, 140, 540, 200], "content": "Revenue grew 23% year-over-year." } ]}