Reputation: 57

how to inflate (decompress) image from pdf using node.js?

I have an image downloaded from the internet, and a PDF created from the same image using Chrome print page function.

When I compress and then decompress the same image everything works well

  const deflate = zlib.createDeflate();
  const inp = fs.createReadStream('output.png');
  const out = fs.createWriteStream('encoded.txt');
  inp.pipe(deflate).pipe(out);
  // file is deflated, great

  const inflate = zlib.createInflate();
  const input = fs.createReadStream('decoded.txt');
  const output = fs.createWriteStream('decoded.png');
  input.pipe(inflate).pipe(output);
  // file is the same exact image as expected

But when I'm doing the same with pdf image it's not working and throwing not quite helpful Uncaught Error: Zlib error and that's all. I suggest that image is malformed, possibly metadata from PDF is required to be included in the image somehow, but I'm not quite sure what is the exact reason why it's not working.

One more detail, I'm just trimming all the lines before and after start and end of the image stream so it reads only this image data.

here are some lines before the image stream object in pdf

4 0 obj
<</Type /XObject
/Subtype /Image
/Width 1500
/Height 970
/ColorSpace [/ICCBased 5 0 R]
/BitsPerComponent 8
/Filter /FlateDecode
/Length 320983>> stream

Can someone suggest what I'm missing here?

link to the pdf file, but in general I think it should work similar way for any other PDF created this way

Thanks

Upvotes: 0

Answers (2)

K J

Reputation: 11867

The specific start of the question is how to deflate a PDF /Filter/FlateDecode stream which needs to be filtered as input or output and (depending on tools) in this case can be as simple as running,

stream.bin | zlib-flate -uncompress >stream.map to get the raw uncompressed pixels.

In Windows you would use CMD>type as the driver and a filter from say qpdf there are 4 sources but it does not matter any headless uncompressed pixel map output is always the same.

Recompression is replace the uncompressing switch with compress=# where # (0-9) is how slow you want an image to run during compression and decompression. Compress=0 is fastest (roughly no delay) as all it does is add a header to the image

for /l %c in (0,1,9) do type stream.map | zlib-flate -compress=%c >stream.%c

Each step may incrementally work the size downwards. Note the source (.bin) is already sitting on that scale optimally between 6 and 7. each lower step is "diminishing returns"

There are many ways to store an image in a PDF, however at the most basic level there are embedments of native files via /DCTDecode or a reference URI to their external image data streams (e.g. v:/Public/imagestream.jpg). To be avoided as a security issue, thus only used in closed systems as ./local file.

4 0 obj <</BitsPerComponent 8
/ColorSpace/DeviceRGB/F(imagestream.jpg)/FFilter/DCTDecode
/Height 2736/Length 1/Subtype/Image/Type/XObject/Width 3648>>
stream
endstream
endobj

The Alternative basic method is using pixel bitmaps. As per this question, that use /FlateDecode compression. Since PDFs cannot support natively embedded PNG or many other modern file formats, except a few JPEG or TIFF variants.

Unsupported native images types like PNG are converted on import into one or two compressed raw pixelmaps that have no metadata for re-formatting on extraction.

The image stored in the pdf has no density and is only meta-physically (Based on speculative or abstract reasoning) described as bits times /Width/Height/Length(of compressed data).

Let use use an extreme minimal example this would be the PDF entry for a page full of 9 uncompressed pixels from a PNG. (There are no dpi involved inside PDF so all pixels are any size or shape of a rectangle).

7 0 obj<</Length 27/Type/XObject/Subtype/Image/Width 3/Height 3/BitsPerComponent 8/SMask 5 0 R/ColorSpace/DeviceRGB>>
stream
ÿ   ÿ   ÿ ÿÿÿ ÿÿÿ    ÿÿÿ   
endstream
endobj
5 0 obj <</Length 4/Type/XObject/Subtype/Image/Width 3/Height 3/BitsPerComponent 1/ColorSpace/DeviceGray>>
stream
ÿÿÿ
endstream
endobj

Points to note are that the alpha channel is independent of the RGB colours and they don't have to be in any numeric order or PDF location.

Ignoring the alpha, how can compression be used to reduce the 27 bytes length of the pixels? The PDF method is apply any acceptable compressive /Filter as encoding.

Common filters are JPEG (/DCT) and GZip (/Flate) or HEX (/ASCII) etc.
Clearly Hex is not helpful except for analysis.

7 0 obj <</Length 55/Type/XObject/Subtype/Image/Width 3/Height 3/BitsPerComponent 8/SMask 5 0 R/ColorSpace/DeviceRGB/Filter/ASCIIHexDecode>>
stream
ff000000ff000000ff00ffffff00ffffff00000000ffffff000000>
endstream
endobj

If we use /FlateDecode it will be minimal.

So uncompressed the image size is 27 and flate=0 will add 11 bytes flate=1 would be a slow reduction to only 3 bytes less and thus the best is somewhere between 3 and 9. So most libs will optimise for 6 as a good all-round universal value.

bytes
  27 colour.map

  38 colour.0
  24 colour.1
  24 colour.2
  22 colour.3
... default 6
  22 colour.9

Here we see that optimum 22 (it cannot be less at this size)

7 0 obj
<</Length 22/Type/XObject/Subtype/Image/Width 3/Height 3/BitsPerComponent 8/SMask 5 0 R/ColorSpace/DeviceRGB/Filter/FlateDecode>>
stream
xœûÏÀÀðŒÿÿ‡`L §sõ
endstream
endobj

Point to note here is the /Flate stream introduces minimal overheads by virtue of header data. However the length has been significantly decreased for 9 colours and if we remove any colour it would be even shorter.

To inflate you multiply colour bytes by height by width to get the extracted dimensions so in the OP case that is 3 x 1500 x 970 = 4365000 (4.2 MB) as a headless pixelmap which can then be saved as any file format you wish such as use with a PNG library.

A simpler method is to use any command line tool that has such code included and poppler pdfimages has several output formats so -all will try the best match.

The output in this case will be a compressed 570,505 bytes.png

Alternatively mutool extract will produce a more compressed output of 328,335 bytes.png

Other PDF PixMaps as PNG libraries are available. It is possible to squish the image a few percent using a range of optimisers but takes time to test each one for the best with a given image. I can drop that image as lossless down to 297323

WHATEVER image you extract it can be altered as a PNG via many different compressors so using the best all those PNG can be optimally compressed losslessly down to 292,562 bytes but the file overheads size will increase and thus overall be 298101 bytes. This is due to not all compressions are acceptable in PDF so it needs in this case conversion on import to /Filter/FlateDecode/Length 297323 smaller but at best less than 10% for lots of processing time.

If we do not degrade the image and use a PDF conversion of the 328,335 in a new PDF it will become flatted by PDFium as 321 KB (328,894 bytes) and if we do the same with the 4.2MB using PDFium it will become 321 KB (328,894 bytes).

So it does not matter what an image is externally, as "within a PDF" it generally should always losslessly be, as you already have, the smallest possible within that PDF! We can alter other parts of the file but not that image unless altering quality via degradation of image density or gamut.

To make a PDF image smaller there are 2 basic methods. Either reduce colours (perfect for scanning text pages or flat color diagrams) which maintains resolution. Or removing pixels ideally as 50% integer rescale. Otherwise there is convert to JPEG.

Converting to Jpeg is a compromise between those 2 so significantly smaller as 76.1 KB (78,005 bytes)

The less compression you use the closer to original quality.

At 100% quality (near lossless) the JPEG when Flated produces a PDF of 621 KB (636,127 bytes)

DeviceRGB/Filter/FlateDecode/Height 970/Length 635110/Subtype/Image/Type/XObject/Width 1500>>stream

Thus the existing /FlatDecode 4.2 MB BMP (or otherwise) reduced to headless (no format / metadata) is the winner in terms of size and maintaining quality.

Answer

In comments it was stated the missing "aim" is to deflate alter and replace the image. This can be tested manually (NotePad++ or other binary editor) If you know how to test each flatted form for PDF compatibility, but not very simply by command line.

I want to replace images in PDF with the same ones but compress them using libvips or ImageMagick.

Thus without destroying it (lossless replacement) we will see it will often be bigger using ImageMagick etc.

There are several tools on the cross platform market that will attempt such a task. Often using GhostScript & or ImageMagick, or a variety of zlib-deflate compressors. Generally they will return the same size or larger PDF.

Let us look at that workflow the source PDF = 323,666 bytes. The Decompressor (Filter/FlateDecode as seen above) will produce a Pixel Image of say with minimalistic header = 4,365,016 bytes and Image Magick will losslessly condense that down to a PNG of different sizes say 80% scale=118,914 Bytes or 100% scale = 216,842 bytes or 292,541 bytes depending on compression. That will be returned to be reconverted into a PDF compatible /Filter/Flated stream in the PDF and the file will become a total of 334,532 bytes. Thus larger.

Other compression commands will shave a tad off in other areas of the file. But so as not to degrade the image(s), many will often increase that already optimal example. Some will remove a lot more of a PDF structure and we can reduce the file lower over a longer time using file comparison.

323,843 bytes cpdf -compress -squeeze-no-pagedata "%filename%" -o "cpdf-c-n.pdf"
322,783 bytes slower cpdf -compress -squeeze "%filename%" -o "cpdf-c-s.pdf"
322,698 bytes (at times can be very slow) pdfsizeopt.exe in.pdf (in.pso.pdf)
330,699 bytes slow reflate via qpdf --recompress-flate --optimize-images 299,253 bytes even slower Flate compression to the absolute 9 limit.

There is really only one reasonable way to decrease size significantly given a compressed flatted stream flipping from pixel to pixel and that is reduce the number of colour changes by reducing the colours.

So here the colours have been reduced to 120 and the compressed size is also naturally reduced. The less colour changes there are, the smaller the FLATE stream will be.

Reducing pixel count (rescale say to 80%) usually increases colours so also needs colour reduction to counteract that colour interpolation. However there is loss of clarity even if then a smaller size.

Upvotes: 1

AKX

Reputation: 169338

The inflated bits you get are a raw 1500x970 array of bytes – we just happen to know there's 3 per pixel for this example, but more robust code should really parse the /ColorSpace definition.

Note that this is very non-robust toy code, and you really should be parsing the PDF stream properly instead of yolo'ing to find headers in the buffer.

It's also shelling out to ImageMagick to do the raw-to-PNG conversion (without validation of outName being safe), but you could probably use e.g. sharp or whatnot to do that within Node.

import fsp from "node:fs/promises";
import zlib from "node:zlib";
import { execSync } from "node:child_process";

function findNextXObject(buf, offset) {
  const xObjectPos = buf.indexOf("<</Type /XObject", offset);
  if (xObjectPos === -1) {
    return null;
  }
  const streamStartPos = buf.indexOf("stream", xObjectPos);
  if (streamStartPos === -1) {
    return null;
  }
  const streamEndPos = buf.indexOf("endstream", streamStartPos);
  if (streamEndPos === -1) {
    return null;
  }
  return { xObjectPos, streamStartPos, streamEndPos };
}

async function convertRawXImageToPNG(headerString, content, outName) {
  const width = parseInt(headerString.match(/\/Width (\d+)/)[1], 10);
  const height = parseInt(headerString.match(/\/Height (\d+)/)[1], 10);
  const bpc = parseInt(headerString.match(/\/BitsPerComponent (\d+)/)[1], 10);
  if (!(width && height && bpc)) {
    console.log("Could not parse width, height, or bits per component from header string:", headerString);
    return;
  }
  const colorspace = `rgb`; // TODO: this should be read from the `/ColorSpace` entry in the headerString
  // Shell out to ImageMagick to convert the raw image to PNG; this could
  // probably be done with a native Node.js module.
  // TODO: this should use better sanitization!
  execSync(`magick -depth ${bpc} -size ${width}x${height} ${colorspace}:- ${outName}`, {
    input: content,
  });
}

async function main() {
  const bits = await fsp.readFile("hero.pdf");
  let offset = 0;
  while (true) {
    const res = findNextXObject(bits, offset);
    if (!res) {
      break;
    }
    const { xObjectPos, streamStartPos, streamEndPos } = res;
    const headerString = bits.subarray(xObjectPos, streamStartPos).toString();
    let content = bits.subarray(streamStartPos + 7, streamEndPos);
    if (headerString.includes("/Filter /FlateDecode")) {
      content = zlib.inflateSync(content);
    }
    let fn = `temp-${xObjectPos}.bin`;
    await fsp.writeFile(fn, content);
    console.log("Wrote:", fn, "Length:", content.length);
    if (headerString.includes("/Subtype /Image")) {
      const fn = `temp-${xObjectPos}.png`;
      await convertRawXImageToPNG(headerString, content, fn);
      console.log("Wrote PNG:", fn);
    }
    offset = streamEndPos;
  }
}

main();

This outputs

Found XObject: <</Type /XObject
/Subtype /Image
/Width 1500
/Height 970
/ColorSpace [/ICCBased 5 0 R]
/BitsPerComponent 8
/Filter /FlateDecode
/Length 320983>>
Wrote: temp-875.bin Length: 4365000
Wrote PNG: temp-875.png

The temp-875.png is then the Airbnb screenshot we've come to expect, approximately 236 kilobytes. That's expected, since ImageMagick likely uses a better PNG encoding algorithm (and the raw deflated bytes can't use PNGs' inter-line filters etc.).

In honesty, I'd use pdfimages from the Poppler project for this, instead of writing any PDF-mangling code by hand.

Upvotes: 1

how to inflate (decompress) image from pdf using node.js?

Answers (2)

Answer

Related Questions