Reputation: 31

Extracting images from PDF using pypdfium2 (Python)

I am trying to extract images from a PDF document using this specific library: pypdfium2 (https://pypi.org/project/pypdfium2/).

I would love to use PyMuPDF instead (given it's excellent speed and versatility), but because it uses a copy-left license I CANNOT use it for my workflow. So please don't provide an answer that advises me to use PyMuPDF.

Any suggestions are appreciated. I've looked through the docs but can't seem to find any image extraction methods.

To be clear, I am not trying to convert the PDF pages into images, I am trying to extract images within the document itself (assuming there are any). Images are typically embedded as either jpeg's or png's.

Upvotes: 3

Answers (3)

K J

Reputation: 11727

The PDF generally uses two types of means to store images. One is to take the raw image and embed it. Those are usually JPEG and tend to use one type of compression. There are several methods like inline and indirect but the point is they are "as inserted".

Thus they will not change compression or quality, unless extracted, re-compressed and re-inserted. A question that many people ask is why can't I compress PDF images in place? It is possible but tricky.

The other way is the RGB or GREY or MONO components are inserted as bitmaps (of one type or another) and for PNG (or those with Alpha Transparency) a second image is added as a SoftMask. Thus now two images per insertion. These are even harder to handle.

So easy Free and Open Source Software (FOSS) solutions are hard to come by:

$ pdfimages <PDF file> -list (pdfimages command ref)
This will give you clues as to some structures and extract what it can (not all)

e.g.

--0000.ppm: page=1 width=1800 height=682 hdpi=599.67 vdpi=599.12 colorspace=DeviceRGB bpc=8
--0001.ppm: page=3 width=1834 height=665 hdpi=345.93 vdpi=345.75 colorspace=DeviceRGB bpc=8

So what images are those? The first is 22 colors of near black and near white thus grey scale but almost monochrome in nature, could be converted externally to 600 dpi black and white!

The second is a screenshot from Amazon showing an iPhone so a high proportion of Orange and Black with some Red and Blue too, thus that can be converted into a JPEG or PNG (without alpha), at 346 dpi.xxx as whichever you wish !

And so on. In this case the majority are better candidates for lossless PNG, than that second one which alone would best be output as if it were a JPEG.

Basically reversing PDF raw image inputs is not simple for deciding what to output.

Untested

Please, also try pypdfium2 extract-images --help to see its built in options are (I understand from docs --render should help).

Upvotes: 0

user3435121

Reputation: 655

You can use pdfimages, a command line tool (Linux builtin).
It is efficient, will support six images formats, and can convert all of them to PNG if you need uniformity. Please, check that the user, pl1nk posted helpful example scripts.

Upvotes: 0

mara004

Reputation: 2342

pypdfium2 maintainer here - I found this thread by chance. Yes, this is possible, and also documented. Take a look at PdfPage.get_objects() and PdfImage.extract() (or PdfImage.get_bitmap()).

There's also a built-in CLI pypdfium2 extract-images as testing utility. Its implementation demonstrates how to use the above APIs.

However, due to limitations in pdfium's public interface, pypdfium2 is by far not as good at image extraction as would technically be possible. You may want to consider pikepdf (MPL2-licensed), it's the best and most sophisticated tool for this task IMO.

(BTW, It's better to ask such questions on pypdfium2's discussions page on GitHub, then you're more likely to get a dev response.)

Upvotes: 3

Extracting images from PDF using pypdfium2 (Python)

Answers (3)

Untested

Related Questions