Skip to content

Reading Order

PDF documents don’t store text in reading order. A two-column academic paper has text positioned absolutely on the page — the PDF viewer renders it visually, but the underlying data is a jumble of coordinates.

EdgeParse uses an enhanced XY-Cut algorithm to determine reading order:

  1. Recursive splitting — divide the page into horizontal and vertical strips
  2. Column detection — identify column boundaries from whitespace gaps
  3. Block ordering — sort blocks within columns top-to-bottom, left-to-right
  4. Cross-column flow — handle elements that span multiple columns
Page Layout XY-Cut Analysis Reading Order
┌────┬────┐ ┌────┬────┐ 1. Title (span)
│ Title │ │ █████████│ 2. Col 1, Para 1
├────┼────┤ ──▶ ├──▶─┼──▶─┤ ──▶ 3. Col 1, Para 2
│Par1│Par3│ │ 1 │ 3 │ 4. Col 2, Para 3
│Par2│Par4│ │ 2 │ 4 │ 5. Col 2, Para 4
└────┴────┘ └────┴────┘ 6. Footer (span)

EdgeParse achieves a NID score of 0.889 on 200 diverse documents — the highest reading-order accuracy in the current benchmark snapshot.

ToolNID Score
EdgeParse0.889
OpenDataLoader0.873
Docling0.867
PyMuPDF4LLM0.852