Reputation: 67
Using PyMuPDF, I want to extract all images from pdf and save them separately and replace all images in pdf with just their image names at the same image place and save as another document. I can save all images with following code.
import fitz
#This creates the Document object doc
doc = fitz.open("Article_Example_1_2.pdf")
html_text=""
for i in range(len(doc)):
print(doc[i]._getContents())
for img in doc.getPageImageList(i):
xref = img[0]
pix = fitz.Pixmap(doc, xref)
if pix.n - pix.alpha < 4: # this is GRAY or RGB or pix.n < 5
pix.writePNG("p%s-%s.png" % (i, xref))
else: # CMYK: convert to RGB first
pix1 = fitz.Pixmap(fitz.csRGB, pix)
pix1.writePNG("p%s-%s.png" % (i, xref))
pix1 = None
pix = None
doc.save(filename=r"new.pdf")
doc.close()
but not sure how to replace them all in pdf with their stored images names. Would greatly appreciate if anyone can help me out here.
Upvotes: 0
Views: 2330
Reputation: 3120
Message from the repo maintainer:
I am not sure whether we have discussed this in the issue blog of the repo. What you can do is using the new feature "redaction annotation". Basic approach:
Page.getImageBbox()
.Page.addRedactAnnot(bbox, text=filename, ...)
.Page.apply_redactions()
. This will remove all images and all redactions. The chosen filename will appear in the former image bbox.Make sure to use PyMuPDF v1.17.0 or later.
Upvotes: 2