Hayat
Hayat

Reputation: 1639

Count Images in a pdf document through python

Is there a way to count number of images(JPEG,PNG,JPG) in a pdf document through python?

Upvotes: 0

Views: 2984

Answers (2)

Ganesh Tata
Ganesh Tata

Reputation: 1195

  1. Using pdfimages from poppler-utils

You might want to take a look at pdfimages from the poppler-utils package.

I have taken the sample pdf from - Sample PDF

On running the following command, images present in the pdf are extracted -

pdfimages /home/tata/Desktop/4555c-5055cBrochure.pdf image

Some of the images extracted from this brochure are -

Extracted Image1

Extracted Image 2

So, you can use python's subprocess module to execute this command, and then extract all the images.

Note: There are some drawbacks to this method. It generates images in ppm format, not jpg. Also, some additional images might be extracted, which might actually not be images in the pdf.

  1. Using pdfminer

If you want to do this using pdfminer, take a look at this blog post - Extracting Text & Images from PDF Files

Pdfminer allows you to traverse through the layout of a particular pdf page. The following image shows the layout objects as well as the tree structure generated by pdfminer -

Layout Objects and Tree Structure

Image Source - Pdfminer Docs

Thus, extracting LTFigure objects can help you extract / count images in the pdf document.

Note: Please note that both of these methods might not be accurate, and their accuracy is highly dependent on the type of pdf document you are dealing with.

Upvotes: 1

Amarpreet Singh
Amarpreet Singh

Reputation: 2260

I don't think this can be directly done. Although I have done something similar using the following approach

  1. Using ghostscript to convert pdf to page images.
  2. On each page use computer vision (OpenCV) to extract the area of interest(in your case images).

Upvotes: 0

Related Questions