Reputation: 1637

Extract all Images from PDF with Python, and retain their transparency

I see a number of solutions on the web and here for extracting images from a PDF with PyMuPDF, PyPDF2, and others, but none them successfully retain transparency information, are using deprecated code that no longer works, or the questions have gone unanswered. The examples I try show a black background where the transparency should be. If I open the PDF in photoshop and pull out the image it is transparent backgrounded as I would expect. So I know the information is in there somewhere. Does anyone have an example that does this with Python?

Here is an example of a post with solutions that do extract images, but they all convert to the wrong file format, or save as png but missing transparent aspects.

Extract images from PDF without resampling, in python?

Upvotes: 1

Answers (2)

voidspaces

Reputation: 1

This code worked well, retaining transparency:

import os
import fitz  # PyMuPDF

def extract_images(pdf_path, output_dir):
    # Open the PDF file
    pdf_document = fitz.open(pdf_path)

    # Iterate through each page in the PDF
    for page_num in range(len(pdf_document)):
        page = pdf_document[page_num]

        # Get a list of images on the page
        image_list = page.get_images(full=True)

        for image_index, img in enumerate(image_list):
                xref, smask, w, h = [img[n] for n in (0, 1, 2, 3)]

                # checks for the presence of a soft mask
                if smask:
                    mask = fitz.Pixmap(pdf_document, smask)
                    if (mask.width != w) or (mask.height != h):
                    mask = fitz.Pixmap(mask, w, h, None)
                    image = fitz.Pixmap(pdf_document, xref)
                    image = fitz.Pixmap(image, mask)
                    image = fitz.Pixmap(fitz.csRGB, image)
                else:
                    image = fitz.Pixmap(pdf_document, xref)

                # Save the image
                image.save(f"{output_dir}/image_page{page_num + 1}_{image_index + 1}.png", "PNG")

    pdf_document.close()

extract_images("path/to/your/pdf_file", "path/to/your/output_dir")

Upvotes: 0

K J

Reputation: 11730

PDF Images are not what you seem to expect. So let's take one sample, but all inserts can be done differently (otherwise there would be no need for different extraction apps). PDF was not designed for splitting retrospectively, many objects were simplified for toner ink for transfer on usually white paper, thus transparency was an afterthought, needing 4th version (%PDF-1.4++)

Here is a file with nothing other than what appears to be a single transparent image. Note it says its object 19, but there is nothing else!

Let's query what is in the file with poppler utility since most Python libraries depend on that or GhostScript.

Poppler\poppler-22.04.0\Library\bin>pdfimages -list tt.pdf
page   num  type   width height color comp bpc  enc interp  object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
   1     0 image     950   575  gray    1   8  image  no        27  0    72    72 42.4K 7.9%
   1     1 image     950   575  index   1   8  image  no        19  0    72    72 2454B 0.4%
   1     2 smask     950   575  gray    1   8  image  no        19  0    72    72 42.4K 7.9%

What we can now see is there are 3 images and 2 of them are number 19, so now we see there are two images 27 and 19 but 19 also has an overlay (softMask).

This is where PDF cannot store a RGBA file as one object but it needs split in to RGB image and a Greytone for the transparency. so here are the 2 images viewed in a graphics viewer. And many libraries need some image library to blend them back into one PNG.

An alternative library is MuPDF

MuPDF\1.20.0-tesseract>mutool extract -a tt.pdf
extracting image-0019.png
extracting image-0020.png
extracting image-0027.png

The simplest cli extractor is thus MuPDF so PyMuPDF should be able to export a similar blend. (However since I use single line programs without the overheads Python imposes, I can't confirm the code needed in this specific case.)

Side note that using a different image extractor that should maintain transparency provided a different view of contents and extracted the transparent image plus a smaller thumbnail, which on looking at producer (Adobe Illustrator) explains much of the differing graphics issues.

pdfcpu_0.3.12>pdfcpu  images  list tt.pdf
pages: all
2 images available
page  obj#  id  type width height colorspace comp bpc interp size filters
=========================================================================
   1    19 Im0 smask   950    575    Indexed    3   8        2 KB FlateDecode
        13     thumb   105     57    Indexed    3   8         171 ASCII85Decode,FlateDecode

         <xmp:Thumbnails>
            <rdf:Alt>
               <rdf:li rdf:parseType="Resource">
                  <xmpGImg:width>256</xmpGImg:width>
                  <xmpGImg:height>156</xmpGImg:height>
                  <xmpGImg:format>JPEG</xmpGImg:format>
...
<photoshop:LayerText>Testing</photoshop:LayerText>
...
<rdf:li>Cyan</rdf:li>
<rdf:li>Magenta</rdf:li>
<rdf:li>Yellow</rdf:li>
<rdf:li>Black</rdf:li>
...
<</BC 23 0 R/G 24 0 R/S/Luminosity/Type/Mask>>

In fact there were several references to masks !

Upvotes: 4

Extract all Images from PDF with Python, and retain their transparency

Answers (2)

Related Questions