Reputation: 25
I'm decoding PDF files using Python with reference to the 2008 spec: https://web.archive.org/web/20081203002256/https://www.adobe.com/devnet/acrobat/pdfs/PDF32000_2008.pdf particularly section 7.4.4.4.
Images are usually embedded in PDF as byte streams, and each stream is associated with a dictionary with information about the stream. For example, often the stream is a compressed form of the original data; such details are described by the Filter
entry in the dictionary.
When I've got a stream whose filter is FlateDecode
, this means the data were compressed using deflate, and this is easily reversed with zlib.decompress
. But... to improve compression the original data may be preprocessed by a filter, for example to difference adjacent bytes - when the data have a lot of similar values the result then compresses better. The preprocessing is identified by the Predictor
entry in the dictionary.
The Predictor
value 15 means to use a PNG differencing algorithm; unfortunately the 2008 PDF document basicly says "PNG prediction (on encoding, PNG optimum)". Yay.
Can someone explain to me (a) exactly which PNG filter algorithm this means (with a reference to its specification) and (b) ideally point me at a library which will reverse it. Lacking the latter I'd have to reverse it in pure Python, which will be slow. Acceptably slow for my initial use case, and I guess I can write it as a C extension (much) later if my needs become more frequent.
Where I am at present is:
bytes
object, which is raw pixel dataPredictor
value, 15 in my present example documentCurrently my image
property method looks like this:
@property
def image(self):
im = self._image
if im is None:
decoded_bs = self.decoded_payload
print(".image: context_dict:")
print(decoded_bs[:10])
pprint(self.context_dict)
decode_params = self.context_dict.get(b'DecodeParms', {})
color_transform = decode_params.get(b'ColorTransform', 0)
color_space = self.context_dict[b'ColorSpace']
bits_per_component = decode_params.get(b'BitsPerComponent')
if not bits_per_component:
bits_per_component = {b'DeviceRGB': 8, b'DeviceGray': 8}[color_space]
colors = decode_params.get(b'Colors')
if not colors:
colors = {b'DeviceRGB': 3, b'DeviceGray': 1}[color_space]
mode_index = (color_space, bits_per_component, colors, color_transform)
width = self.context_dict[b'Width']
height = self.context_dict[b'Height']
print("mode_index =", mode_index)
PIL_mode = {
(b'DeviceGray', 1, 1, 0): 'L',
(b'DeviceGray', 8, 1, 0): 'L',
(b'DeviceRGB', 8, 3, 0): 'RGB',
}[mode_index]
print(
"Image.frombytes(%r,(%d,%d),%r)...", PIL_mode, width, height,
decoded_bs[:32]
)
im = Image.frombytes(PIL_mode, (width, height), decoded_bs)
im.show()
exit(1)
self._image = im
return im
This shows me the "edgy" and skewed image because I'm decoding difference data as colour data and decoding the row tags as pixel data, skewing subsequent rows slightly.
Upvotes: 0
Views: 340
Reputation: 227
The esteemed Mark Adler, who happens to be the first author of the PNG specification, already answered part (a) of the question about the specification of the PNG filter algorithm.
I will answer part (b) on how to reverse the filtering efficiently. Python's Pillow library already implements filtering, so the question is how to use it. My approach is as follows:
Here is a demonstration that downloads a PDF and extracts three images from it.
import os, zlib, struct, urllib.request
from PIL import Image
from io import BytesIO
def decode_png_idat(idat_data, width, height, header=[8, 2, 0, 0, 0]):
def write_chunk(chunk_type, chunk_data):
f.write(struct.pack(">I", len(chunk_data)))
f.write(chunk_type)
f.write(chunk_data)
f.write(struct.pack(">I", zlib.crc32(chunk_type + chunk_data)))
f = BytesIO()
f.write(b"\x89PNG\r\n\x1a\n")
write_chunk(b"IHDR", struct.pack(">IIBBBBB", width, height, *header))
write_chunk(b"IDAT", idat_data)
write_chunk(b"IEND", b"")
f.seek(0)
return Image.open(f)
def main():
# backup URL in case the other URL is down
# https://web.archive.org/web/20211023132308/https://sjtrny.com/files/10.1109_DICTA.2012.6411686.pdf
url = "https://sjtrny.com/files/10.1109_DICTA.2012.6411686.pdf"
filename = url.split("/")[-1]
# Download PDF for testing
if not os.path.exists(filename):
print(f"Downloading {url} to {filename}")
urllib.request.urlretrieve(url, filename)
# Read PDF bytes
with open(filename, "rb") as f:
data = f.read()
# Three example images with hardcoded offsets into the PDF
example_image_data = [
data[2828574:2828574+13271],
data[5275869:5275869+211014],
data[6939898:6939898+840881],
]
width = 800
height = 563
# Decode and show images
for image_data in example_image_data:
image = decode_png_idat(image_data, width, height)
image.show()
if __name__ == "__main__":
main()
Upvotes: 1
Reputation: 112502
The predictor used for each row is given by the first byte in each row, if the "Predictor" parameter is 10 or more. In that case, the value of that parameter has no further meaning. It doesn't matter that it's 15, other than the fact that 15 >= 10.
You can find the filter types here:
Upvotes: 3