Reputation: 799
I'm trying to count the number of images within a PDF using Python and write the results to a csv file. Ideally, I would like to return a csv which shows a column for the file and a column for each page with the number of images in each page. But a column showing the file name and the total number of images in the document would suffice.
I have tried:
import fitz
import io
from PIL import Image
import csv
with open(r'output.csv', 'x', newline='', encoding='utf-8') as csvfile:
# Declaring the writer
propertyWriter = csv.writer(csvfile, quoting=csv.QUOTE_ALL)
# Writing the headers
propertyWriter.writerow(['file', 'results', 'error'])
for file in pdfs:
# open the file
pdf_file = fitz.open(file)
# printing number of images found in this page
if image_list:
results = len(image_list[0])
error = ""
#print(results)
#results = str(f"+ Found a total of {len(image_list)} images in page {page_index}")
else:
error = str("! No images found on page", page_index)
propertyWriter.writerow([file, results, error])
Reference: https://www.geeksforgeeks.org/how-to-extract-images-from-pdf-in-python/ However, with this option is declaring thee are 9 images in each PDF which isn't the case.
I have then tried:
import fitz
import csv
with open(r'output.csv', 'x', newline='', encoding='utf-8') as csvfile:
# Declaring the writer
propertyWriter = csv.writer(csvfile, quoting=csv.QUOTE_ALL)
# Writing the headers
propertyWriter.writerow(['file', 'results'])
for file in pdfs[0:5]:
for i in range(len(doc)):
for img in doc.getPageImageList(i):
xref = img[0]
pix = fitz.Pixmap(doc, xref)
results = str(pix)
propertyWriter.writerow([file, results])
Reference: Extract images from PDF without resampling, in python? But again this is saying there is the same number of images in each PDF which is not the case.
Upvotes: 0
Views: 1405
Reputation: 218
I tried the first refence mentioned by you (https://www.geeksforgeeks.org/how-to-extract-images-from-pdf-in-python/) and it works perfectly (the code on that page). What is wrong it? It counts images from each page from the PDF and you just have to sum it together per pdf?
If you put this to the for loop, you should be able to reach your goals?
import fitz
import io
from PIL import Image
file = "doctest.pdf"
pdf_file = fitz.open(file)
results = 0
for page_index in range(len(pdf_file)):
image_list = pdf_file[page_index].getImageList()
# printing number of images found in this page
if image_list:
results += len(image_list)
print("Total images in this PDF: ", results)
Upvotes: 1