Skip to content

Table Extraction

EdgeParse uses two complementary methods for table detection:

When PDFs contain ruling lines (re, l, m operators), EdgeParse identifies grid patterns and maps text runs into cells.

For borderless tables, EdgeParse analyzes text alignment patterns to infer column boundaries and row groupings.

After initial detection, EdgeParse identifies spanning cells by analyzing:

  • Cells with content that overlaps multiple grid positions
  • Header rows with merged cells
  • Row/column span attributes from tagged PDF structure

EdgeParse achieves a TEDS score of 0.783 — the highest among rule-based tools:

ToolTEDS ScoreType
Docling0.887ML-based
Marker0.825ML-based
EdgeParse0.783Rule-based
EdgeQuake0.795ML-enhanced
PyMuPDF4LLM0.540Rule-based

Tables in JSON output include full cell structure:

{
"type": "table",
"id": 3,
"page number": 1,
"rows": [
{
"row number": 0,
"cells": [
{"col": 0, "content": "Quarter", "is header": true},
{"col": 1, "content": "Revenue", "is header": true}
]
}
]
}