Mohammad Ahmed
Mohammad Ahmed

Reputation: 67

Replacing Images with Image Names instead in Pdf using pymupdf

Using PyMuPDF, I want to extract all images from pdf and save them separately and replace all images in pdf with just their image names at the same image place and save as another document. I can save all images with following code.

import fitz
#This creates the Document object doc
doc = fitz.open("Article_Example_1_2.pdf")
html_text=""
for i in range(len(doc)):
    print(doc[i]._getContents())
    for img in doc.getPageImageList(i):
        xref = img[0]
        pix = fitz.Pixmap(doc, xref)
        if pix.n - pix.alpha < 4:       # this is GRAY or RGB   or pix.n < 5
            pix.writePNG("p%s-%s.png" % (i, xref))
        else:               # CMYK: convert to RGB first
            pix1 = fitz.Pixmap(fitz.csRGB, pix)
            pix1.writePNG("p%s-%s.png" % (i, xref))
            pix1 = None
        pix = None

doc.save(filename=r"new.pdf")

doc.close()

but not sure how to replace them all in pdf with their stored images names. Would greatly appreciate if anyone can help me out here.

Upvotes: 0

Views: 2330

Answers (1)

Jorj McKie
Jorj McKie

Reputation: 3120

Message from the repo maintainer:

I am not sure whether we have discussed this in the issue blog of the repo. What you can do is using the new feature "redaction annotation". Basic approach:

  1. Calculate the bbox of each image via Page.getImageBbox().
  2. Add a redaction annotation via Page.addRedactAnnot(bbox, text=filename, ...).
  3. When finished with the page, execute Page.apply_redactions(). This will remove all images and all redactions. The chosen filename will appear in the former image bbox.
  4. Save as a new document.

Make sure to use PyMuPDF v1.17.0 or later.

Upvotes: 2

Related Questions