Maverick
Maverick

Reputation: 799

Count number of images in PDF with Python

I'm trying to count the number of images within a PDF using Python and write the results to a csv file. Ideally, I would like to return a csv which shows a column for the file and a column for each page with the number of images in each page. But a column showing the file name and the total number of images in the document would suffice.

I have tried:

import fitz
import io
from PIL import Image
import csv

with open(r'output.csv', 'x', newline='', encoding='utf-8') as csvfile:
    # Declaring the writer 
    propertyWriter = csv.writer(csvfile, quoting=csv.QUOTE_ALL)
    # Writing the headers 
    propertyWriter.writerow(['file', 'results', 'error'])
    for file in pdfs:

        # open the file
        pdf_file = fitz.open(file)


        # printing number of images found in this page
        if image_list:
            results = len(image_list[0])
            error = ""
            #print(results)
            #results = str(f"+ Found a total of {len(image_list)} images in page {page_index}")

        else:
            error = str("! No images found on page", page_index)
        propertyWriter.writerow([file, results, error])

Reference: https://www.geeksforgeeks.org/how-to-extract-images-from-pdf-in-python/ However, with this option is declaring thee are 9 images in each PDF which isn't the case.

I have then tried:

import fitz
import csv
with open(r'output.csv', 'x', newline='', encoding='utf-8') as csvfile:
    # Declaring the writer 
    propertyWriter = csv.writer(csvfile, quoting=csv.QUOTE_ALL)
    # Writing the headers 
    propertyWriter.writerow(['file', 'results'])
    for file in pdfs[0:5]:
        for i in range(len(doc)):
            for img in doc.getPageImageList(i):
                xref = img[0]
                pix = fitz.Pixmap(doc, xref)
                results = str(pix)

    propertyWriter.writerow([file, results])

Reference: Extract images from PDF without resampling, in python? But again this is saying there is the same number of images in each PDF which is not the case.

Upvotes: 0

Views: 1405

Answers (1)

barbwire
barbwire

Reputation: 218

I tried the first refence mentioned by you (https://www.geeksforgeeks.org/how-to-extract-images-from-pdf-in-python/) and it works perfectly (the code on that page). What is wrong it? It counts images from each page from the PDF and you just have to sum it together per pdf?

If you put this to the for loop, you should be able to reach your goals?

import fitz
import io
from PIL import Image

file = "doctest.pdf"
pdf_file = fitz.open(file)
results = 0

for page_index in range(len(pdf_file)):
    image_list = pdf_file[page_index].getImageList()
    
    # printing number of images found in this page
    if image_list:
        results += len(image_list)

print("Total images in this PDF: ", results)

Upvotes: 1

Related Questions