Skip to content
Open Source The PDF extraction engine for the AI era

The PDF Engine for RAG Pipelines

Feed your LLMs clean structured data. EdgeParse extracts headings, tables, lists, and reading order from any PDF — in milliseconds, with zero ML dependencies. Built in Rust.

pip install edgeparse
0+ pages/sec
0% accuracy
0 ML dependencies
0 SDK languages
Works with
Python Node.js Rust CLI

One Command. PDF Superpowers for Your AI Agent.

Install the EdgeParse skill and your agents instantly know how to read any PDF.

# Step 1: Register the EdgeParse agent skill
npx skills add raphaelmansuy/edgeparse --skill edgeparse

# This adds to skills-lock.json:
# {
#   "version": 1,
#   "skills": {
#     "edgeparse": {
#       "source": "raphaelmansuy/edgeparse",
#       "sourceType": "github"
#     }
#   }
# }

# Step 2: Install the Python runtime
pip install edgeparse
# macOS / Linux — one-time setup
brew tap raphaelmansuy/tap
brew install edgeparse

# Verify installation
edgeparse --version

# Parse a PDF to Markdown
edgeparse report.pdf --format markdown

# Parse to JSON with bounding boxes
edgeparse invoice.pdf --format json

# Batch convert a directory
edgeparse docs/*.pdf --format markdown --output-dir results/
import edgeparse, json

# Convert PDF to Markdown
md = edgeparse.convert("report.pdf", format="markdown")
print(md[:500])

# Parse structured JSON with bounding boxes
doc = json.loads(edgeparse.convert("report.pdf", format="json"))
for el in doc["elements"][:3]:
  print(el["type"], el["text"][:60])

# Save to output file
path = edgeparse.convert_file("report.pdf", output_dir="out/", format="markdown")

# Extract specific pages with table clustering
md = edgeparse.convert("report.pdf", pages="1-5", table_method="cluster")
import { convert, convertFile } from "edgeparse";

// Convert PDF to Markdown
const md = convert("report.pdf", { format: "markdown" });
console.log(md.slice(0, 500));

// Parse structured JSON output
const doc = JSON.parse(convert("invoice.pdf", { format: "json" }));
doc.elements.slice(0, 3).forEach(el => console.log(el.type, el.text));

// Extract specific pages
const pages = convert("report.pdf", { format: "markdown", pages: "1-5" });

// Save to output directory
const path = convertFile("report.pdf", { outputDir: "out/", format: "markdown" });
# Install via Homebrew (macOS / Linux)
brew tap raphaelmansuy/tap && brew install edgeparse

# Or via pip
pip install edgeparse

# Extract PDF to Markdown
edgeparse report.pdf --format markdown

# Extract to JSON with bounding boxes
edgeparse invoice.pdf --format json

# Batch convert entire directory
edgeparse docs/*.pdf --format markdown --output-dir results/
Features

What AI Agents Get with EdgeParse

EdgeParse gives AI agents full PDF comprehension — structured, deterministic, production-ready.

Read Complex Tables

Ruling-line and borderless table detection with cell span merging. Agents get clean table data they can reason about accurately.

Understand Document Structure

Headings, paragraphs, lists, figures — all classified and nested. Agents see the full hierarchy, not a flat blob of text.

Correct Reading Order

Multi-column layouts, sidebars, captions — agents read your PDF in the right logical order, every time.

Sub-second Extraction

Process a 200-page PDF in under a second. No GPU warm-up, no model inference, no waiting. 18× faster than Docling.

Zero Setup

No Java, no Tesseract, no GPU, no OCR model downloads. pip install edgeparse and you're done.

Deterministic Output

Same PDF always produces the same output. No hallucinations, no random failures — agents get consistent, reliable data.

Give Your AI Agents PDF Superpowers

One command. Instant PDF parsing for Claude, LlamaIndex, LangChain, and any AI agent.

npx skills add raphaelmansuy/edgeparse --skill edgeparse