EdgeParse is a high-performance PDF-to-structured-data extraction engine written in Rust. It converts complex PDFs into clean, structured JSON, Markdown, or HTML in milliseconds without ML dependencies.

How fast is EdgeParse compared to other PDF parsers?

EdgeParse processes 40+ pages per second — 10 to 100× faster than Python-based alternatives like Docling or Marker. It achieves 0.026s average processing time per document.

What programming languages does EdgeParse support?

EdgeParse provides native bindings for Python (via PyO3), Node.js (via NAPI-RS), a standalone CLI binary, and can be used directly as a Rust library crate.

Does EdgeParse require GPU or ML models?

No. EdgeParse is a rule-based extraction engine with zero ML dependencies. No GPU, no Java, no Poppler, no Tesseract required. Just pip install edgeparse and go.

RAG Integration

Why EdgeParse for RAG?

EdgeParse produces structured output with reading order, heading hierarchy, and table structure — all critical for high-quality retrieval-augmented generation.

LangChain Integration

import edgeparse
from langchain.text_splitter import MarkdownHeaderTextSplitter

# Extract structured Markdown
markdown = edgeparse.convert("document.pdf", format="markdown")

# Split by heading hierarchy
splitter = MarkdownHeaderTextSplitter(
    headers_to_split_on=[
        ("#", "H1"),
        ("##", "H2"),
        ("###", "H3"),
    ]
)

chunks = splitter.split_text(markdown)

for chunk in chunks:
    print(chunk.metadata)
    print(chunk.page_content[:100])
    print("---")

LlamaIndex Integration

import edgeparse
from llama_index.core import Document

# Extract as JSON for maximum metadata
json_str = edgeparse.convert("document.pdf", format="json")
import json
data = json.loads(json_str)

# Create documents with rich metadata
documents = []
for element in data["kids"]:
    if element["type"] in ("paragraph", "heading"):
        doc = Document(
            text=element["content"],
            metadata={
                "page": element["page number"],
                "type": element["type"],
                "source": data["file name"],
            }
        )
        documents.append(doc)

Best Practices

Use Markdown format for heading-based chunking
Use JSON format when you need bounding boxes or element metadata
Enable safety filters to prevent prompt injection from malicious PDFs
Preserve table structure — structured tables improve retrieval quality