EdgeParse is a high-performance PDF-to-structured-data extraction engine written in Rust. It converts complex PDFs into clean, structured JSON, Markdown, or HTML in milliseconds without ML dependencies.

How fast is EdgeParse compared to other PDF parsers?

EdgeParse processes 40+ pages per second — 10 to 100× faster than Python-based alternatives like Docling or Marker. It achieves 0.026s average processing time per document.

What programming languages does EdgeParse support?

EdgeParse provides native bindings for Python (via PyO3), Node.js (via NAPI-RS), a standalone CLI binary, and can be used directly as a Rust library crate.

Does EdgeParse require GPU or ML models?

No. EdgeParse is a rule-based extraction engine with zero ML dependencies. No GPU, no Java, no Poppler, no Tesseract required. Just pip install edgeparse and go.

Architecture Overview

Crate Dependency Graph

edgeparse-cli
  └── edgeparse-core
        └── pdf-cos

edgeparse-python
  └── edgeparse-core (via PyO3)

edgeparse-node
  └── edgeparse-core (via napi-rs)

Crates

Crate	Purpose
`pdf-cos`	Low-level COS object parser (xref, streams, filters)
`edgeparse-core`	Document model, extraction pipeline, renderers
`edgeparse-cli`	CLI binary
`edgeparse-python`	Python bindings (PyO3 + maturin)
`edgeparse-node`	Node.js bindings (napi-rs)

Pipeline Stages

EdgeParse processes a PDF through a 20-stage pipeline:

Byte loading — read file or buffer
Header parse — verify %PDF- magic
Xref parse — locate cross-reference table
Trailer parse — extract root, info, encrypt refs
Object resolve — dereference indirect objects
Stream decode — decompress FlateDecode, LZW, etc.
Page tree walk — enumerate pages from /Pages tree
Content stream parse — tokenize PDF operators (Tj, TJ, Tm, etc.)
Font map build — extract /ToUnicode, /Encoding, built-in maps
Glyph decode — map character codes → Unicode
Text run assembly — group characters into runs by font/position
Line detection — cluster runs into lines by baseline
Reading order — sort lines into column-aware reading order
Paragraph merge — join lines into paragraphs by indentation/spacing
Table detection — identify grid structures via ruling lines
Heading detection — classify headings by font-size heuristics
List detection — recognize bullet/numbered lists
Image extraction — decode inline/XObject images
Document model build — assemble Block tree
Rendering — output JSON, Markdown, HTML, or plain text

Design Principles

Zero ML dependency — all extraction is rule-based
Zero copy where possible — borrowed references into the PDF byte buffer
No unsafe — safe Rust only (except FFI boundaries)
Streaming capable — pages can be processed independently
Deterministic — same input always produces same output