ykaganovich
ykaganovich

Reputation: 14964

Image size in PDF

I have some code that's extracting images out of PDF documents. I want to skip images that are too big, but I don't know how to find out an image is too big. I tried using PdfImageXObject.getCOSObject().getLength(), but in my test it appears to return a much bigger value than the image size on the file system. How do I find out, at least approximately, how big the image is, in bytes, without actually extracting it (an expensive operation)?

Upvotes: 0

Views: 572

Answers (1)

David van Driessche
David van Driessche

Reputation: 7048

getLength() is not a good measure as it returns the encoded length of the stream. Depending on the encoding used in the PDF file, and the encoding you use on the file system, you'll end up with either a smaller or larger value.

  • Image in the PDF uses JPEG encoding, you don't encode on saving the image: getLength() will be much smaller than the size on the filesystem.
  • Image in the PDF is not encoded, you save as a JPEG image: your image on disk will be much smaller.

A more reliable way to do this would be to look at the width and height of the image which you can get from PDImage. This gives you the number of pixels horizontally and vertically. PDImage returns this using getWidth() and getHeight().

This will not the exactly correct, if you want to total byte size of an image you would also have to look at the color space to see how many components per pixel (3 for RGB, 4 for CMYK for example) and how many bits per component in the image. But you can probably skip those values for the purpose you're looking at and just make due with the width and height to get a rough indication of whether you want to save this or not.

Upvotes: 3

Related Questions