ravshanovbek
ravshanovbek

Reputation: 11

extract vector charts from PDF

I can't find the solution to how to isolate or extract vector chart and graphs(that are not images) from pdf.

I have tried extract directly, but I realize that it is not that straight forward. I was using mymupdf. This script extracts and saves only images. But I needed to save charts that are not the images. In PDF apparently it is stored differently.

import fitz
import os 

pdf_path = 'path to pdf'
output_folder = 'your output folder'
os.makedirs(output_folder, exist_ok=True)

doc = fitz.open(pdf_path)
chart_count = 0

page = doc.load_page(0)
img_list = page.get_images(full=True )
for img_index, img in enumerate(img_list):
    base_image = doc.extract_image(img[0])
    image_bytes = base_image["image"]
    image = Image.open(io.BytesIO(image_bytes))
    image_path = os.path.join(output_folder, 
        f"chart_{chart_count+1}.png")
    image.save(image_path)
    chart_count += 1

This one only performs good on image type in PDF but not for vector charts. Do you have any suggestions or solutions?

Sample PDF file ( where you can see not all charts are being extracted)

Upvotes: 0

Views: 31

Answers (1)

K J
K J

Reputation: 11857

You have correctly described PDF is different components on a page. Some are areas of colour and others are text and perhaps JPEG images so when we strip the background paper colours the first 6 pages match that description well.

Floating images and floating text characters in chart like pages. Any page colours or linework are totally separate sub page objects. enter image description here

Moving on to the ones you hope to see different. We can see these are either images or simply just parts of a page thus not independent graphics for extraction. enter image description here

Thus to extract objects from an area they must be gathered by co-ordinates in your Region of Interest (ROI) or redact the others from the page.

enter image description here enter image description here

PyMuPdf is good at redaction so trim all the page outside the Region of interest using X and Y REDACTION boxes.

Then once all the surrounding data is deleted ensure the remaining text is one colour for ease of viewing.

The culmination of editing With MuPDF can thus be a single page PDF of the retained and edited area.

enter image description here

Finally you can reduce the page size to what you design it to be.

enter image description here

The code would be too large for me to write each custom page editor so I simply cut and paste using Mutools and Notepad as far easier.

enter image description here

Upvotes: 0

Related Questions