Reputation: 347
I;m using this library PYMUPDF (Documentation) that offers various functions to deal with PDF documents using python.
What I want to achieve: I would like to extract all the images (I cannot use typical methods as the images are not raster. They are vectors with machine-readable text hence I would like to display the PDF page with just the image) and it's labels (i.e. "Figure 1: XYZ") from a PDF document.
Where I am now: I am able to narrow down to the pages that contain images, convert the PDF page into an image and rename the file with it's labels.
I'm hoping if is was a way to remove all text from the page, then I could save the image file with just the image (and some white space, which should be fine)
Upvotes: 0
Views: 499
Reputation: 61
I don't have any idea about python, but this is something that can easily by done using UniPDF. They have built-in code for many functions and you can customize the code based on your needs. See their examples at https://github.com/unidoc/unipdf-examples.
I am confident this will help you a lot.
Upvotes: 0