John Van de Pol
John Van de Pol

Reputation: 5

How to decode a PdfImageObject with filter "[/FlateDecode, /RunLengthDecode]"

I'm already succesfully extracting images from PDF's since a few years. I use itextsharp to do this. I get a PdfImageObject and get the filter. Mostly this filter is "/FlateDecode". In that case,I use pdf.PdfReader.FlateDecode(bytes, True) to decode the raw bytes.

But recently I'm confronted with pdf's with PdfImageObjects with filter: "[/FlateDecode, /RunLengthDecode]".

So I guess that the raw bytes must be decoded twice!?!?

I found some code on the internet for the /RunLengthDecode part: https://github.com/kusl/itextsharp/blob/master/tags/iTextSharp_5_4_5/src/core/iTextSharp/text/pdf/FilterHandlers.cs

I tried to apply both decode options on the image. First /FlateDecode and then /RunLengthDecode. And second /RunLengthDecode and then /FlateDecode.

But the /RunLengthDecode code gives me in both scenarios an error.

Upvotes: 0

Views: 3035

Answers (1)

mkl
mkl

Reputation: 95918

This actually is not an answer to the question as is but an analysis of the problem that led to this question.

In comments to the question it turned out that a bug in iText is the reason why the OP tries to manually filter raw streams and extract images: Certain images were extracted with small errors. The OP identified the problematic images to be those with filters [/FlateDecode, /RunLengthDecode].

The bug

The bug in question indeed is iText's implementation of the RunLengthDecode filter, here from iText for .Net 5.5.x:

private class Filter_RUNLENGTHDECODE : IFilterHandler {

    public byte[] Decode(byte[] b, PdfName filterName, PdfObject decodeParams, PdfDictionary streamDictionary) {
     // allocate the output buffer
        MemoryStream baos = new MemoryStream();
        sbyte dupCount = -1;
        for (int i = 0; i < b.Length; i++){
            dupCount = (sbyte)b[i];
            if (dupCount == -128) break; // this is implicit end of data

            if (dupCount >= 0 && dupCount <= 127){
                int bytesToCopy = dupCount+1;
                baos.Write(b, i, bytesToCopy);
                i+=bytesToCopy;
            } else {
                // make dupcount copies of the next byte
                i++;
                for (int j = 0; j < 1-(int)(dupCount);j++){ 
                    baos.WriteByte(b[i]);
                }
            }
        }
        return baos.ToArray();
    }
}

More exactly it is this line:

                baos.Write(b, i, bytesToCopy);

It should have copied the next bytesToCopy bytes after index i -- at index i there is the count value after all -- but this command copies the next bytesToCopy bytes starting at index i. Thus, for every run of bytes to copy once iText instead first copies the count byte and then all but the final byte of the run.

Instead the line should be

                baos.Write(b, i+1, bytesToCopy);

Example effect on bitmap images

As runs of duplicate bytes are correctly extracted and even for long, non-duplicate runs there are many correct bytes (at off-by-one positions), the images iText extracted only look slightly wrong with small errors, e.g.:

Damaged image:

damaged image

Undamaged image:

undamaged image

Pervasiveness of the bug

This bug has been in iText 5.x for .Net for many years. Furthermore, it has also been present in iText 5.x for Java for many years and still is, e.g. here from the current 5.5.13-SNAPSHOT:

private static class Filter_RUNLENGTHDECODE implements FilterHandler{

    public byte[] decode(byte[] b, PdfName filterName, PdfObject decodeParams, PdfDictionary streamDictionary) throws IOException {
     // allocate the output buffer
        ByteArrayOutputStream baos = new ByteArrayOutputStream();
        byte dupCount = -1;
        for(int i = 0; i < b.length; i++){
            dupCount = b[i];
            if (dupCount == -128) break; // this is implicit end of data

            if (dupCount >= 0 && dupCount <= 127){
                int bytesToCopy = dupCount+1;
                baos.write(b, i, bytesToCopy);
                i+=bytesToCopy;
            } else {
                // make dupcount copies of the next byte
                i++;
                for(int j = 0; j < 1-(int)(dupCount);j++){ 
                    baos.write(b[i]);
                }
            }
        }

        return baos.toByteArray();
    }
}

and in iText 7, e.g. here from the current 7.1.2-SNAPSHOT for Java:

public class RunLengthDecodeFilter implements IFilterHandler {

    @Override
    public byte[] decode(byte[] b, PdfName filterName, PdfObject decodeParams, PdfDictionary streamDictionary) {
        ByteArrayOutputStream baos = new ByteArrayOutputStream();
        byte dupCount;
        for (int i = 0; i < b.length; i++) {
            dupCount = b[i];
            if (dupCount == (byte) 0x80) { // this is implicit end of data
                break;
            }
            if (dupCount >= 0) {
                int bytesToCopy = dupCount + 1;
                baos.write(b, i, bytesToCopy);
                i += bytesToCopy;
            } else {                // make dupcount copies of the next byte
                i++;
                for (int j = 0; j < 1 - (int) (dupCount); j++) {
                    baos.write(b[i]);
                }
            }
        }
        return baos.toByteArray();
    }
}

Most likely this bug could remain that long because the RunLengthDecode filter hardly ever has been used for a number of years.

Upvotes: 3

Related Questions