Skip to content

JSON Schema

The --format json output uses a flat kids array:

{
"file name": "document.pdf",
"number of pages": 12,
"author": "string | null",
"title": "string | null",
"creation date": "string | null",
"modification date": "string | null",
"kids": [Element]
}

Each element in kids has a type field identifying its kind:

FieldTypeDescription
typestring"heading", "paragraph", "table", "list", "image", "caption", "header", "footer"
idintGlobally unique sequential ID
page numberint1-based page index
bounding box[left, bottom, right, top]Coordinates in PDF points
contentstringExtracted text
fontstringFont name
font sizefloatFont size in points
FieldTypeDescription
levelstring"title", "section", "subsection"
heading levelintNumeric heading level (1–6)
FieldTypeDescription
rowsRow[]Array of row objects
FieldTypeDescription
row numberint0-based row index
cellsCell[]Array of cell objects
FieldTypeDescription
colint0-based column index
contentstringCell text
is headerbool?Whether in a header row
row spanint?Rows spanned (default: 1)
col spanint?Columns spanned (default: 1)
{
"file name": "report.pdf",
"number of pages": 1,
"author": null,
"title": "Quarterly Report",
"creation date": "2024-01-15",
"modification date": null,
"kids": [
{
"type": "heading",
"id": 1,
"level": "title",
"heading level": 1,
"page number": 1,
"bounding box": [72, 90, 540, 120],
"font": "Helvetica-Bold",
"font size": 24.0,
"content": "Quarterly Report"
},
{
"type": "paragraph",
"id": 2,
"page number": 1,
"bounding box": [72, 140, 540, 200],
"content": "Revenue grew 23% year-over-year."
}
]
}