Architecture Overview
Crate Dependency Graph
Section titled “Crate Dependency Graph”edgeparse-cli └── edgeparse-core └── pdf-cos
edgeparse-python └── edgeparse-core (via PyO3)
edgeparse-node └── edgeparse-core (via napi-rs)Crates
Section titled “Crates”| Crate | Purpose |
|---|---|
pdf-cos | Low-level COS object parser (xref, streams, filters) |
edgeparse-core | Document model, extraction pipeline, renderers |
edgeparse-cli | CLI binary |
edgeparse-python | Python bindings (PyO3 + maturin) |
edgeparse-node | Node.js bindings (napi-rs) |
Pipeline Stages
Section titled “Pipeline Stages”EdgeParse processes a PDF through a 20-stage pipeline:
- Byte loading — read file or buffer
- Header parse — verify
%PDF-magic - Xref parse — locate cross-reference table
- Trailer parse — extract root, info, encrypt refs
- Object resolve — dereference indirect objects
- Stream decode — decompress FlateDecode, LZW, etc.
- Page tree walk — enumerate pages from
/Pagestree - Content stream parse — tokenize PDF operators (Tj, TJ, Tm, etc.)
- Font map build — extract /ToUnicode, /Encoding, built-in maps
- Glyph decode — map character codes → Unicode
- Text run assembly — group characters into runs by font/position
- Line detection — cluster runs into lines by baseline
- Reading order — sort lines into column-aware reading order
- Paragraph merge — join lines into paragraphs by indentation/spacing
- Table detection — identify grid structures via ruling lines
- Heading detection — classify headings by font-size heuristics
- List detection — recognize bullet/numbered lists
- Image extraction — decode inline/XObject images
- Document model build — assemble
Blocktree - Rendering — output JSON, Markdown, HTML, or plain text
Design Principles
Section titled “Design Principles”- Zero ML dependency — all extraction is rule-based
- Zero copy where possible — borrowed references into the PDF byte buffer
- No unsafe — safe Rust only (except FFI boundaries)
- Streaming capable — pages can be processed independently
- Deterministic — same input always produces same output