Skip to content

JSON Schema

The --format json output uses a flat kids array:

{
"file name": "document.pdf",
"number of pages": 12,
"author": "string | null",
"title": "string | null",
"creation date": "string | null",
"modification date": "string | null",
"kids": [Element]
}

Each element in kids has a type field identifying its kind:

FieldTypeDescription
typestring"heading", "paragraph", "table", "list", "image", "caption", "header", "footer"
idintGlobally unique sequential ID
page numberint1-based page index
bounding box[left, bottom, right, top]Coordinates in PDF points
contentstringExtracted text (present on paragraph, heading, caption, list item; absent on table, image, list, header, footer)
fontstringFont name (text elements only)
font sizefloatFont size in points (text elements only)
FieldTypeDescription
levelstringSemantic label: "Title", "Subtitle", "Heading1", "Heading2", "Heading3", "Heading4"
heading levelintNumeric heading level (1–6)
FieldTypeDescription
rowsRow[]Array of row objects
FieldTypeDescription
row numberint0-based row index
cellsCell[]Array of cell objects
FieldTypeDescription
typestringAlways "table cell"
row numberint1-based row index
column numberint1-based column index
row spanintNumber of rows spanned (default: 1)
column spanintNumber of columns spanned (default: 1)
kidsElement[]Nested child elements (typically empty)
{
"file name": "report.pdf",
"number of pages": 1,
"author": null,
"title": "Quarterly Report",
"creation date": "2024-01-15",
"modification date": null,
"kids": [
{
"type": "heading",
"id": 1,
"level": "title",
"heading level": 1,
"page number": 1,
"bounding box": [72, 90, 540, 120],
"font": "Helvetica-Bold",
"font size": 24.0,
"content": "Quarterly Report"
},
{
"type": "paragraph",
"id": 2,
"page number": 1,
"bounding box": [72, 140, 540, 200],
"content": "Revenue grew 23% year-over-year."
}
]
}