Skip to content

Image Extraction

EdgeParse can detect and extract images embedded in PDF documents, reporting their bounding boxes and page locations.

Images are identified in the JSON output:

{
"type": "image",
"id": 5,
"page number": 1,
"bounding box": [72, 300, 540, 500]
}
Terminal window
# Extract with image metadata
edgeparse document.pdf -f json
# Images are reported in the kids array with type "image"
import edgeparse
import json
json_str = edgeparse.convert("document.pdf", format="json")
data = json.loads(json_str)
for element in data["kids"]:
if element["type"] == "image":
print(f"Image on page {element['page number']}")
print(f" Bounding box: {element['bounding box']}")

In Markdown format, images are represented as placeholders with their page and position information.