Reputation: 117
I am currently working on an Python 3.x image extractor for pdf-files and can't seem to find a solution for the problem I have been facing throughout my work. My intention is to extract all the images of pdf-files (vehicle reports) without the logos of the company that provides these papers. So far I have a working code using fitz, that finds the images and stores them (I found this code in the internet). Unfortunately they are returned in the wrong order. For annotating the pictures with their headings, they have to be saved in the order how they are seen in the pdf.
I already tried to get this right by using the object names defined in the xref-String (string defining an object in the pdf) in ascending order. Before that version I annotated the pictures with a counter through a dict (which I know is unsorted, but fixed it with sorting the keys), but had about 2-4 of approximatley 30 images unsorted. Additionally this code doens't seem to be a good solution for me because I 'fake' the image number by annotating a counter.
My current version (xref Name):
import fitz
import sys
import re
checkXO = r"/Type(?= */XObject)" # finds "/Type/XObject"
checkIM = r"/Subtype(?= */Image)" # finds "/Subtype/Image"
doc = fitz.open(fr"{pdfpath}")
lenXREF = doc._getXrefLength() # number of objects
pixmaps = {}
imgcount=0
count=0
imglist=[]
for i in range(1, lenXREF): # scan through all objects
text = doc._getXrefString(i) # string defining the object
isXObject = re.search(checkXO, text) # tests for XObject
isImage = re.search(checkIM, text) # tests for Image
if not isXObject or not isImage: # not an image object if not both True
continue
count+=1
pix = fitz.Pixmap(doc, i) # make pixmap from image
if re.search(r'Name \WIm(\d+)',text) != None:
imglist.append(re.search(r'Name \W(Im\d+)',text).group(1))
pixmaps[re.search(r'Name \W(Im\d+)',text).group(1)]=pix
if re.search(r'Name \W(Im\d+)',text) == None:
imglist.append(count)
pixmaps[count]=pix
imglist1=[]
for i in range(1,doc.pageCount):
if len(doc.getPageImageList(i))>1:
for entry in doc.getPageImageList(i):
imglist1.append(entry[7])
break
for entry in imglist1:
pixmaps[entry].writeImage(fr"{dirpath}\%s.jpg" % (imgcount),'jpg')
imgcount+=1
Feel free to also suggest a completely new way to work on this task. Thanks in advance for your help.
Upvotes: 2
Views: 4183
Reputation: 1
Use the sorted() for the image list. if you can use the different version refer to https://stackoverflow.com/a/68267356/7240889
Upvotes: 0
Reputation: 3110
Answer from repo maintainer:
In the newer PyMuPDF versions (best use v1.17.0) you can get an image's position on the page. This seems to be your intention wehen you talk of "right oder": rect = page.getImageBbox(name)
, where name is your entry[7]
above.
Upvotes: 3