EdgeParse is a high-performance PDF-to-structured-data extraction engine written in Rust. It converts complex PDFs into clean, structured JSON, Markdown, or HTML in milliseconds without ML dependencies.

How fast is EdgeParse compared to other PDF parsers?

EdgeParse processes 40+ pages per second — 10 to 100× faster than Python-based alternatives like Docling or Marker. It achieves 0.026s average processing time per document.

What programming languages does EdgeParse support?

EdgeParse provides native bindings for Python (via PyO3), Node.js (via NAPI-RS), a standalone CLI binary, and can be used directly as a Rust library crate.

Does EdgeParse require GPU or ML models?

No. EdgeParse is a rule-based extraction engine with zero ML dependencies. No GPU, no Java, no Poppler, no Tesseract required. Just pip install edgeparse and go.

AI Safety Filters

Why Safety Matters

PDFs can contain hidden content that’s invisible to humans but extracted by parsers. Malicious PDFs can use this to inject prompts into LLM pipelines.

Safety Filters

EdgeParse includes built-in content safety filters:

Hidden Text Detection

Text rendered with the same color as the background, or with render mode 3 (invisible), is flagged and optionally excluded.

Off-Page Content

Content positioned outside the visible page area (negative coordinates, beyond page boundaries) is detected and filtered.

Tiny Text

Text smaller than a configurable threshold (default: 3pt) is flagged. This catches hidden micro-text used for SEO manipulation or prompt injection.

OCG Layer Filtering

Optional Content Groups (OCG) that are set to invisible by default are detected and handled appropriately.

Configuration

import edgeparse

# Safety filters are enabled by default
markdown = edgeparse.convert("document.pdf", format="markdown")

# All content including hidden text
markdown = edgeparse.convert("document.pdf",
    format="markdown",
    include_hidden=True
)

RAG Security

When building RAG pipelines, these filters help prevent:

Prompt injection — hidden instructions embedded in PDFs
Data poisoning — invisible content that skews embeddings
Context pollution — irrelevant hidden text diluting retrieval quality