EdgeParse is a high-performance PDF-to-structured-data extraction engine written in Rust. It converts complex PDFs into clean, structured JSON, Markdown, or HTML in milliseconds without ML dependencies.

How fast is EdgeParse compared to other PDF parsers?

EdgeParse processes 40+ pages per second — 10 to 100× faster than Python-based alternatives like Docling or Marker. It achieves 0.026s average processing time per document.

What programming languages does EdgeParse support?

EdgeParse provides native bindings for Python (via PyO3), Node.js (via NAPI-RS), a standalone CLI binary, and can be used directly as a Rust library crate.

Does EdgeParse require GPU or ML models?

No. EdgeParse is a rule-based extraction engine with zero ML dependencies. No GPU, no Java, no Poppler, no Tesseract required. Just pip install edgeparse and go.

Running Benchmarks

Prerequisites

Python 3.10+
uv (recommended) or pip
EdgeParse CLI installed

Quick Start

cd benchmark
uv sync          # install dependencies
uv run python run.py --engine edgeparse

Running a Single Tool

uv run python run.py --engine edgeparse

Available tools include edgeparse, docling, marker, opendataloader, pymupdf4llm, markitdown, and hybrid adapters such as opendataloader_hybrid_docling_fast.

Running Grouped Comparisons

uv run python compare_all.py --group non-ocr
uv run python compare_all.py --group hybrid
uv run python compare_all.py --group ocr --install

This keeps fast local parsers, hybrid backends, and OCR-heavy engines in separate reports. Use uv run python compare_all.py --all only when you want a single combined report.

Custom PDFs

Place your PDF files in benchmark/pdfs/ and matching ground-truth Markdown in benchmark/ground-truth/markdown/.

# Run with custom corpus
uv run python run.py --engine edgeparse --input-dir ./my-pdfs

Viewing Reports

Reports are generated as HTML files:

open reports/benchmark-latest.html

Thresholds

The thresholds.json file defines minimum acceptable scores:

{
  "nid": 0.85,
  "teds": 0.70,
  "mhs": 0.75,
  "overall": 0.80
}

CI will fail if EdgeParse scores drop below these thresholds.