RAG Integration
Why EdgeParse for RAG?
Section titled “Why EdgeParse for RAG?”EdgeParse produces structured output with reading order, heading hierarchy, and table structure — all critical for high-quality retrieval-augmented generation.
LangChain Integration
Section titled “LangChain Integration”import edgeparsefrom langchain.text_splitter import MarkdownHeaderTextSplitter
# Extract structured Markdownmarkdown = edgeparse.convert("document.pdf", format="markdown")
# Split by heading hierarchysplitter = MarkdownHeaderTextSplitter( headers_to_split_on=[ ("#", "H1"), ("##", "H2"), ("###", "H3"), ])
chunks = splitter.split_text(markdown)
for chunk in chunks: print(chunk.metadata) print(chunk.page_content[:100]) print("---")LlamaIndex Integration
Section titled “LlamaIndex Integration”import edgeparsefrom llama_index.core import Document
# Extract as JSON for maximum metadatajson_str = edgeparse.convert("document.pdf", format="json")import jsondata = json.loads(json_str)
# Create documents with rich metadatadocuments = []for element in data["kids"]: if element["type"] in ("paragraph", "heading"): doc = Document( text=element["content"], metadata={ "page": element["page number"], "type": element["type"], "source": data["file name"], } ) documents.append(doc)Best Practices
Section titled “Best Practices”- Use Markdown format for heading-based chunking
- Use JSON format when you need bounding boxes or element metadata
- Enable safety filters to prevent prompt injection from malicious PDFs
- Preserve table structure — structured tables improve retrieval quality