Reputation: 1637
I see a number of solutions on the web and here for extracting images from a PDF with PyMuPDF, PyPDF2, and others, but none them successfully retain transparency information, are using deprecated code that no longer works, or the questions have gone unanswered. The examples I try show a black background where the transparency should be. If I open the PDF in photoshop and pull out the image it is transparent backgrounded as I would expect. So I know the information is in there somewhere. Does anyone have an example that does this with Python?
Here is an example of a post with solutions that do extract images, but they all convert to the wrong file format, or save as png but missing transparent aspects.
Extract images from PDF without resampling, in python?
Upvotes: 1
Views: 1422
Reputation: 1
This code worked well, retaining transparency:
import os
import fitz # PyMuPDF
def extract_images(pdf_path, output_dir):
# Open the PDF file
pdf_document = fitz.open(pdf_path)
# Iterate through each page in the PDF
for page_num in range(len(pdf_document)):
page = pdf_document[page_num]
# Get a list of images on the page
image_list = page.get_images(full=True)
for image_index, img in enumerate(image_list):
xref, smask, w, h = [img[n] for n in (0, 1, 2, 3)]
# checks for the presence of a soft mask
if smask:
mask = fitz.Pixmap(pdf_document, smask)
if (mask.width != w) or (mask.height != h):
mask = fitz.Pixmap(mask, w, h, None)
image = fitz.Pixmap(pdf_document, xref)
image = fitz.Pixmap(image, mask)
image = fitz.Pixmap(fitz.csRGB, image)
else:
image = fitz.Pixmap(pdf_document, xref)
# Save the image
image.save(f"{output_dir}/image_page{page_num + 1}_{image_index + 1}.png", "PNG")
pdf_document.close()
extract_images("path/to/your/pdf_file", "path/to/your/output_dir")
Upvotes: 0
Reputation: 11730
PDF Images are not what you seem to expect. So let's take one sample, but all inserts can be done differently (otherwise there would be no need for different extraction apps). PDF was not designed for splitting retrospectively, many objects were simplified for toner ink for transfer on usually white paper, thus transparency was an afterthought, needing 4th version (%PDF-1.4++)
Here is a file with nothing other than what appears to be a single transparent image. Note it says its object 19, but there is nothing else!
Let's query what is in the file with poppler utility since most Python libraries depend on that or GhostScript.
Poppler\poppler-22.04.0\Library\bin>pdfimages -list tt.pdf
page num type width height color comp bpc enc interp object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
1 0 image 950 575 gray 1 8 image no 27 0 72 72 42.4K 7.9%
1 1 image 950 575 index 1 8 image no 19 0 72 72 2454B 0.4%
1 2 smask 950 575 gray 1 8 image no 19 0 72 72 42.4K 7.9%
What we can now see is there are 3 images and 2 of them are number 19, so now we see there are two images 27 and 19 but 19 also has an overlay (softMask).
This is where PDF cannot store a RGBA file as one object but it needs split in to RGB image and a Greytone for the transparency. so here are the 2 images viewed in a graphics viewer. And many libraries need some image library to blend them back into one PNG.
An alternative library is MuPDF
MuPDF\1.20.0-tesseract>mutool extract -a tt.pdf
extracting image-0019.png
extracting image-0020.png
extracting image-0027.png
The simplest cli extractor is thus MuPDF so PyMuPDF should be able to export a similar blend. (However since I use single line programs without the overheads Python imposes, I can't confirm the code needed in this specific case.)
Side note that using a different image extractor that should maintain transparency provided a different view of contents and extracted the transparent image plus a smaller thumbnail, which on looking at producer (Adobe Illustrator) explains much of the differing graphics issues.
pdfcpu_0.3.12>pdfcpu images list tt.pdf
pages: all
2 images available
page obj# id type width height colorspace comp bpc interp size filters
=========================================================================
1 19 Im0 smask 950 575 Indexed 3 8 2 KB FlateDecode
13 thumb 105 57 Indexed 3 8 171 ASCII85Decode,FlateDecode
<xmp:Thumbnails>
<rdf:Alt>
<rdf:li rdf:parseType="Resource">
<xmpGImg:width>256</xmpGImg:width>
<xmpGImg:height>156</xmpGImg:height>
<xmpGImg:format>JPEG</xmpGImg:format>
...
<photoshop:LayerText>Testing</photoshop:LayerText>
...
<rdf:li>Cyan</rdf:li>
<rdf:li>Magenta</rdf:li>
<rdf:li>Yellow</rdf:li>
<rdf:li>Black</rdf:li>
...
<</BC 23 0 R/G 24 0 R/S/Luminosity/Type/Mask>>
In fact there were several references to masks !
Upvotes: 4