Extract image into a file from PDImageXObject without loading it into memory

Question

This is related to How to extract image bytes out of PDF efficiently, but I'll try to restate the problem differently so it's less about PDF parsing and more about image processing.

I'm using PDFBox to extract images out of PDF files. There's an class PDImageXObject that represents the image inside the PDF, which contains image metadata (height, width, etc), and exposes two APIs to pull out the image are: BufferedImage getImage() and BufferedImage getImage(Rectangle rect, int subsampling);.

The current code is straightforward:

BufferedImage image = pdImage.getImage();
ImageIO.write(image, "jpg", baos);

However, for a large image, I'm having an issue with memory usage, as BufferedImage is storing uncompressed image data in memory, which is a lot bigger than the compressed result.

Is there a way to avoid loading the whole image into memory by breaking it up into tiles (e.g. 1024x1024) and iterating over them using the getImage signature that takes Rectangle? I'm seeing some promising information about JAI being able to use Tiles to output a compressed image without loading the uncompressed content into memory at once, but I don't understand how to tie it together with what I have from PDImageXObject. Or is there another way to do it? Is JAI still an active project?

By the way, the purpose of extracting the image is to feed it into the next component in the pipeline that can handle multiple image formats. So, if some format other than jpg, is more suited for tiled processing, that should be ok.

I'm aware of one possibility using something like BigBufferedImage. But I was thinking processing a Tile at a time looked promising.

Extract image into a file from PDImageXObject without loading it into memory

Answers (1)

Related Questions