extract vector charts from PDF

Question

I can't find the solution to how to isolate or extract vector chart and graphs(that are not images) from pdf.

I have tried extract directly, but I realize that it is not that straight forward. I was using mymupdf. This script extracts and saves only images. But I needed to save charts that are not the images. In PDF apparently it is stored differently.

import fitz
import os 

pdf_path = 'path to pdf'
output_folder = 'your output folder'
os.makedirs(output_folder, exist_ok=True)

doc = fitz.open(pdf_path)
chart_count = 0

page = doc.load_page(0)
img_list = page.get_images(full=True )
for img_index, img in enumerate(img_list):
    base_image = doc.extract_image(img[0])
    image_bytes = base_image["image"]
    image = Image.open(io.BytesIO(image_bytes))
    image_path = os.path.join(output_folder, 
        f"chart_{chart_count+1}.png")
    image.save(image_path)
    chart_count += 1

This one only performs good on image type in PDF but not for vector charts. Do you have any suggestions or solutions?

Sample PDF file ( where you can see not all charts are being extracted)

extract vector charts from PDF

Answers (1)

Related Questions