EdgeParse is a high-performance PDF-to-structured-data extraction engine written in Rust. It converts complex PDFs into clean, structured JSON, Markdown, or HTML in milliseconds without ML dependencies.

How fast is EdgeParse compared to other PDF parsers?

EdgeParse processes 40+ pages per second — 10 to 100× faster than Python-based alternatives like Docling or Marker. It achieves 0.026s average processing time per document.

What programming languages does EdgeParse support?

EdgeParse provides native bindings for Python (via PyO3), Node.js (via NAPI-RS), a standalone CLI binary, and can be used directly as a Rust library crate.

Does EdgeParse require GPU or ML models?

No. EdgeParse is a rule-based extraction engine with zero ML dependencies. No GPU, no Java, no Poppler, no Tesseract required. Just pip install edgeparse and go.

Reading Order

The Problem

PDF documents don’t store text in reading order. A two-column academic paper has text positioned absolutely on the page — the PDF viewer renders it visually, but the underlying data is a jumble of coordinates.

XY-Cut++ Algorithm

EdgeParse uses an enhanced XY-Cut algorithm to determine reading order:

Recursive splitting — divide the page into horizontal and vertical strips
Column detection — identify column boundaries from whitespace gaps
Block ordering — sort blocks within columns top-to-bottom, left-to-right
Cross-column flow — handle elements that span multiple columns

How It Works

Page Layout          XY-Cut Analysis       Reading Order
┌────┬────┐         ┌────┬────┐          1. Title (span)
│ Title    │         │ █████████│          2. Col 1, Para 1
├────┼────┤  ──▶    ├──▶─┼──▶─┤   ──▶   3. Col 1, Para 2
│Par1│Par3│         │ 1  │ 3  │          4. Col 2, Para 3
│Par2│Par4│         │ 2  │ 4  │          5. Col 2, Para 4
└────┴────┘         └────┴────┘          6. Footer (span)

Benchmark

EdgeParse achieves a NID score of 0.911 on 200 diverse documents — the highest reading order accuracy among benchmarked tools.

Tool	NID Score
EdgeParse	0.911
OpenDataLoader	0.912
Docling	0.899
PyMuPDF4LLM	0.888