Cameron Simpson
Cameron Simpson

Reputation: 25

decoding PDF: can I use PIL/Pillow to access the PNG predictor algorithm in order to reverse it for ingest to PIL?

I'm decoding PDF files using Python with reference to the 2008 spec: https://web.archive.org/web/20081203002256/https://www.adobe.com/devnet/acrobat/pdfs/PDF32000_2008.pdf particularly section 7.4.4.4.

Images are usually embedded in PDF as byte streams, and each stream is associated with a dictionary with information about the stream. For example, often the stream is a compressed form of the original data; such details are described by the Filter entry in the dictionary.

When I've got a stream whose filter is FlateDecode, this means the data were compressed using deflate, and this is easily reversed with zlib.decompress. But... to improve compression the original data may be preprocessed by a filter, for example to difference adjacent bytes - when the data have a lot of similar values the result then compresses better. The preprocessing is identified by the Predictor entry in the dictionary.

The Predictor value 15 means to use a PNG differencing algorithm; unfortunately the 2008 PDF document basicly says "PNG prediction (on encoding, PNG optimum)". Yay.

Can someone explain to me (a) exactly which PNG filter algorithm this means (with a reference to its specification) and (b) ideally point me at a library which will reverse it. Lacking the latter I'd have to reverse it in pure Python, which will be slow. Acceptably slow for my initial use case, and I guess I can write it as a C extension (much) later if my needs become more frequent.

Where I am at present is:

Currently my image property method looks like this:

  @property
  def image(self):
    im = self._image
    if im is None:
      decoded_bs = self.decoded_payload
      print(".image: context_dict:")
      print(decoded_bs[:10])
      pprint(self.context_dict)
      decode_params = self.context_dict.get(b'DecodeParms', {})
      color_transform = decode_params.get(b'ColorTransform', 0)
      color_space = self.context_dict[b'ColorSpace']
      bits_per_component = decode_params.get(b'BitsPerComponent')
      if not bits_per_component:
        bits_per_component = {b'DeviceRGB': 8, b'DeviceGray': 8}[color_space]
      colors = decode_params.get(b'Colors')
      if not colors:
        colors = {b'DeviceRGB': 3, b'DeviceGray': 1}[color_space]
      mode_index = (color_space, bits_per_component, colors, color_transform)
      width = self.context_dict[b'Width']
      height = self.context_dict[b'Height']
      print("mode_index =", mode_index)
      PIL_mode = {
          (b'DeviceGray', 1, 1, 0): 'L',
          (b'DeviceGray', 8, 1, 0): 'L',
          (b'DeviceRGB', 8, 3, 0): 'RGB',
      }[mode_index]
      print(
          "Image.frombytes(%r,(%d,%d),%r)...", PIL_mode, width, height,
          decoded_bs[:32]
      )
      im = Image.frombytes(PIL_mode, (width, height), decoded_bs)
      im.show()
      exit(1)
      self._image = im
    return im

This shows me the "edgy" and skewed image because I'm decoding difference data as colour data and decoding the row tags as pixel data, skewing subsequent rows slightly.

Upvotes: 0

Views: 340

Answers (2)

BlueSky
BlueSky

Reputation: 227

The esteemed Mark Adler, who happens to be the first author of the PNG specification, already answered part (a) of the question about the specification of the PNG filter algorithm.

I will answer part (b) on how to reverse the filtering efficiently. Python's Pillow library already implements filtering, so the question is how to use it. My approach is as follows:

  1. Create PNG chunks for a minimal PNG file, using our image data from the PDF for the IDAT chunk.
  2. Write the chunks to a file-like object.
  3. Load the file-like object using Pillow, which automatically decompresses and filters the data.

Here is a demonstration that downloads a PDF and extracts three images from it.

import os, zlib, struct, urllib.request
from PIL import Image
from io import BytesIO

def decode_png_idat(idat_data, width, height, header=[8, 2, 0, 0, 0]):
    def write_chunk(chunk_type, chunk_data):
        f.write(struct.pack(">I", len(chunk_data)))
        f.write(chunk_type)
        f.write(chunk_data)
        f.write(struct.pack(">I", zlib.crc32(chunk_type + chunk_data)))

    f = BytesIO()
    f.write(b"\x89PNG\r\n\x1a\n")
    write_chunk(b"IHDR", struct.pack(">IIBBBBB", width, height, *header))
    write_chunk(b"IDAT", idat_data)
    write_chunk(b"IEND", b"")
    f.seek(0)
    return Image.open(f)

def main():
    # backup URL in case the other URL is down
    # https://web.archive.org/web/20211023132308/https://sjtrny.com/files/10.1109_DICTA.2012.6411686.pdf
    url = "https://sjtrny.com/files/10.1109_DICTA.2012.6411686.pdf"
    filename = url.split("/")[-1]

    # Download PDF for testing
    if not os.path.exists(filename):
        print(f"Downloading {url} to {filename}")
        urllib.request.urlretrieve(url, filename)

    # Read PDF bytes
    with open(filename, "rb") as f:
        data = f.read()

    # Three example images with hardcoded offsets into the PDF
    example_image_data = [
        data[2828574:2828574+13271],
        data[5275869:5275869+211014],
        data[6939898:6939898+840881],
    ]
    width = 800
    height = 563

    # Decode and show images
    for image_data in example_image_data:
        image = decode_png_idat(image_data, width, height)
        image.show()

if __name__ == "__main__":
    main()

Upvotes: 1

Mark Adler
Mark Adler

Reputation: 112502

The predictor used for each row is given by the first byte in each row, if the "Predictor" parameter is 10 or more. In that case, the value of that parameter has no further meaning. It doesn't matter that it's 15, other than the fact that 15 >= 10.

You can find the filter types here:

a, b, c, and x

png filter types

paeth predictor

Upvotes: 3

Related Questions