ykaganovich
ykaganovich

Reputation: 14964

Lossless image extraction from PDF

I'm using PDFBox to extract images out of a PDF file and feed it to another image processing library (that can handle different image formats). My current code is like this:

PDImageXObject pdImage;
ByteArrayOutputStream baos = new ByteArrayOutputStream();
BufferedImage image = pdImage.getImage();
ImageIO.write(image, "png", baos);
byte[] imageBytes = baos.toByteArray();

This will take whatever is stored in the PDF file and use Java graphics to convert it to PNG. Is there a better way to avoid conversion and extract the image in whatever format it is embedded? I don't want to degrade image quality (I suppose mitigated by using a lossless format like PNG?) and incur conversion overhead.

Upvotes: 0

Views: 624

Answers (1)

JosephA
JosephA

Reputation: 1215

The DEFLATE algorithm is used by the FlateDecode filter and by the PNG file format. However a stream of FlateDecode-compressed data isn't itself a PNG file.

Also, you need to consider the colorspace representation of the Image XObject (e.g. DeviceCMYK) versus what PNG actually supports.

By targeting lossless compression for your output image file you won't lose any information. (Be sure you actually need a lossless extracted image, often people assume lossy compression means their image will now have so many changes it's no longer recognizable. Though in many cases depending on the parameters the loss is hardly noticeable to the naked eye and you can substantially benefit from the size savings of Lossy compression.)

If performance is slow it could simply be the quality of your PDF software responsible for extracting the image and saving it.

Upvotes: 2

Related Questions