Tagged PDF Support
What Are Tagged PDFs?
Section titled “What Are Tagged PDFs?”Tagged PDFs include a structure tree that describes the logical organization of content — headings, paragraphs, lists, tables, and figures — with semantic tags similar to HTML.
How EdgeParse Uses Tags
Section titled “How EdgeParse Uses Tags”When a PDF contains structure tags, EdgeParse uses them to:
- Validate heading levels — confirm font-based heading detection with H1–H6 tags
- Identify table structure — use
<Table>,<TR>,<TD>,<TH>tags - Detect lists — recognize
<L>,<LI>,<LBody>tag patterns - Mark figures — identify
<Figure>tags with alt text
Fallback Behavior
Section titled “Fallback Behavior”Most PDFs in the wild are not tagged. EdgeParse’s pipeline works equally well without tags, using font analysis, spatial layout, and alignment patterns as the primary extraction method.
Tagged PDF data is used as supplementary information to improve accuracy when available.
Detection
Section titled “Detection”import edgeparse
result = edgeparse.convert("document.pdf", format="json")# Tagged PDF information is automatically incorporated# into the extraction when available