Joshua Rogers
Joshua Rogers

Reputation: 3558

Error while trying to decompress stream in PDF

I'm trying to decompress a stream from a PDF Object in this file:

 4 0 obj
<< 
/Filter /FlateDecode
/Length 64
>>
stream
xœs
QÐw34V02UIS0´0P030PIQÐpÉÏKIUH-.ITH.-*Ê··×TÉRp
á T‰
Ê
endstream
endobj

I have this stream copy-pasted with the same format as in the original file in a file called Stream.file

xœs
QÐw34V02UIS0´0P030PIQÐpÉÏKIUH-.ITH.-*Ê··×TÉRp
á T‰
Ê

This stream should translate to: Donde esta curro??. Added that stream to a Stream.file in a C# Console application.

using System.IO;
using System.IO.Compression;

namespace Filters
{
    public static class FiltersLoader
    {
        public static void Parse()
        {
            var bytes = File.ReadAllBytes("Stream.file");
            var originalFileStream = new MemoryStream(bytes);

            using (var decompressedFileStream = new MemoryStream())
            using (var decompressionStream = new DeflateStream(originalFileStream, CompressionMode.Decompress))
            {
                decompressionStream.CopyTo(decompressedFileStream);
            }    
        }
    }
}

However it yields an exception whil trying to copy it:

The archive entry was compressed using an unsupported compression method.

I'd like how to decode this stream with .net code if it's possible.

Thanks.

Upvotes: 0

Views: 3960

Answers (2)

mkl
mkl

Reputation: 96009

The main problem is that the DeflateStream class can decode a naked FLATE compressed stream (as per RFC 1951) but the content of PDF streams with FlateDecode filter actually is presented in the ZLIB Compressed Data Format (as per RFC 1950) wrapping FLATE compressed data.

To fix this it suffices to drop the two-byte ZLIB header.

Another problem became clear in your first example document: That document was encrypted, so before FLATE decoding the stream contents therein have to be decrypted.

Drop ZLIB header to get to the FLATE encoded data

The DeflateStream class can decode a naked FLATE compressed stream (as per RFC 1951) but the content of PDF streams with FlateDecode filter actually is presented in the ZLIB Compressed Data Format (as per RFC 1950) wrapping FLATE compressed data.

Fortunately it is pretty easy to jump to the FLATE encoded data therein, one simply has to drop the first two bytes. (Strictly speaking there might be a dictionary identifier between them and the FLATE encoded data but this appears to be seldom used.)

in case of your code:

var bytes = File.ReadAllBytes("Stream.file");
var originalFileStream = new MemoryStream(bytes);

originalFileStream.ReadByte();
originalFileStream.ReadByte();

using (var decompressedFileStream = new MemoryStream())
using (var decompressionStream = new DeflateStream(originalFileStream, CompressionMode.Decompress))
{
    decompressionStream.CopyTo(decompressedFileStream);
}   

In case of encrypted PDFs, decrypt first

Your first example file pdf-test.pdf is encrypted as is indicated by the presence of an Encrypt entry in the trailer:

trailer
<</Size 37/Encrypt 38 0 R>>
startxref
116
%%EOF

Before decompressing stream contents, therefore, you have to decrypt them.

Upvotes: 4

john vine
john vine

Reputation: 11

I have learned from this thread.

While writing a PDF decoder, using the SystemIO.Compression method, I have been stuck at the PDF \FlateDecode stream decode stage for the past week, always getting 'The archive entry was compressed using an unsupported compression method.'

Make sure you get the correct number of bytes from the compressed stream.

Ignore CR and LF after the word 'stream'.

Ignore CR and LF before the word 'endstream'.

Consuming the 2 header bytes, using mkl's 'originalFileStream.ReadByte()' method, was the answer to my final problem!

Thanks very much mkl, for the final piece to my puzzle.

Upvotes: 1

Related Questions