Reputation: 1639
Is there a way to count number of images(JPEG,PNG,JPG) in a pdf document through python?
Upvotes: 0
Views: 2984
Reputation: 1195
You might want to take a look at pdfimages from the poppler-utils package.
I have taken the sample pdf from - Sample PDF
On running the following command, images present in the pdf are extracted -
pdfimages /home/tata/Desktop/4555c-5055cBrochure.pdf image
Some of the images extracted from this brochure are -
So, you can use python's subprocess module to execute this command, and then extract all the images.
Note: There are some drawbacks to this method. It generates images in ppm format, not jpg. Also, some additional images might be extracted, which might actually not be images in the pdf.
If you want to do this using pdfminer, take a look at this blog post - Extracting Text & Images from PDF Files
Pdfminer allows you to traverse through the layout of a particular pdf page. The following image shows the layout objects as well as the tree structure generated by pdfminer -
Layout Objects and Tree Structure
Image Source - Pdfminer Docs
Thus, extracting LTFigure objects can help you extract / count images in the pdf document.
Note: Please note that both of these methods might not be accurate, and their accuracy is highly dependent on the type of pdf document you are dealing with.
Upvotes: 1
Reputation: 2260
I don't think this can be directly done. Although I have done something similar using the following approach
Upvotes: 0