EdgeParse is a high-performance PDF-to-structured-data extraction engine written in Rust. It converts complex PDFs into clean, structured JSON, Markdown, or HTML in milliseconds without ML dependencies.

How fast is EdgeParse compared to other PDF parsers?

EdgeParse processes 40+ pages per second — 10 to 100× faster than Python-based alternatives like Docling or Marker. It achieves 0.026s average processing time per document.

What programming languages does EdgeParse support?

EdgeParse provides native bindings for Python (via PyO3), Node.js (via NAPI-RS), a standalone CLI binary, and can be used directly as a Rust library crate.

Does EdgeParse require GPU or ML models?

No. EdgeParse is a rule-based extraction engine with zero ML dependencies. No GPU, no Java, no Poppler, no Tesseract required. Just pip install edgeparse and go.

Metrics Explained

Overview

EdgeParse benchmarks use three complementary metrics that each measure a different aspect of extraction quality.

NID — Normalized Information Distance

Measures: Reading order and text completeness.

NID compares the extracted plain text against a ground-truth reference using normalized compression distance. A score of 1.0 means the extracted text perfectly matches the reference.

Score	Meaning
0.95+	Excellent — near-perfect text extraction
0.90–0.95	Good — minor differences
0.80–0.90	Fair — some text missing or reordered
< 0.80	Poor — significant extraction errors

TEDS — Tree Edit Distance Similarity

Measures: Table structure accuracy.

TEDS computes the tree edit distance between extracted HTML tables and ground-truth tables, normalized to a 0–1 similarity score. It penalizes missing rows, merged cells, and misaligned columns.

Score	Meaning
0.90+	Excellent — tables nearly identical
0.80–0.90	Good — minor structural differences
0.60–0.80	Fair — some rows/columns misaligned
< 0.60	Poor — significant structural errors

MHS — Markdown Heading Similarity

Measures: Document structure / heading detection accuracy.

MHS compares the heading hierarchy (H1–H6) in the extracted output against the ground truth. It rewards correct heading levels and penalizes missing or incorrectly-leveled headings.

Score	Meaning
0.90+	Excellent — heading hierarchy correct
0.80–0.90	Good — minor level mismatches
0.60–0.80	Fair — some headings missing
< 0.60	Poor — heading detection unreliable

Overall Score

The overall score is the arithmetic mean of NID, TEDS, and MHS:

Overall = (NID + TEDS + MHS) / 3

Each metric is weighted equally because they measure orthogonal quality dimensions:

NID → “Did we extract the right text?”
TEDS → “Did we get the tables right?”
MHS → “Did we detect the document structure?”