Brad
Brad

Reputation: 301

FlateDecode PDF Decoding

I have a result from iTextSharp which is able to be parsed via pdf reader, but I want to be able to take the binary content and parse it manually. I've tried taking the text between the tags <</Length 256/Filter/FlateDecode>>stream and endstreamand using the .NET DeflateStream class to attempt to decompress the text which resulted in this exception:

System.IO.InvalidDataException: Block length does not match with its complement. at System.IO.Compression.Inflater.DecodeUncompressedBlock(Boolean& end_of_block) at System.IO.Compression.Inflater.Decode() at System.IO.Compression.Inflater.Inflate(Byte[] bytes, Int32 offset, Int32 length) at System.IO.Compression.DeflateStream.Read(Byte[] array, Int32 offset, Int32 count) at System.IO.Stream.InternalCopyTo(Stream destination, Int32 bufferSize) at FlateDecodeTest.Decompress(Byte[] data)

My code is:

using System;
using System.Security.Cryptography;
using System.Text;
using System.Diagnostics;
using System.IO;
using System.IO.Compression;

public class FlateDecodeTest
{
    public static void Main() 
    {
        string s = @"xœuÁN!E÷|Å...";

        byte[] b = Decompress(GetBytes(s));

        Console.WriteLine(GetString(b));
    }

    public static byte[] Decompress(byte[] data)
    {
        Console.WriteLine(data.Length);
        byte[] decompressedArray = null;
        try
        {
            using (MemoryStream decompressedStream = new MemoryStream())
            {
                using (MemoryStream compressStream = new MemoryStream(data))
                {
                    using (DeflateStream deflateStream = new DeflateStream(compressStream, CompressionMode.Decompress))
                    {
                        deflateStream.CopyTo(decompressedStream);
                    }
                }
                decompressedArray = decompressedStream.ToArray();
            }
        }
        catch (Exception exception)
        {
            Console.WriteLine(exception);
        }

        return decompressedArray;
    }

    static byte[] GetBytes(string str)
    {
        byte[] bytes = new byte[str.Length * sizeof(char)];
        System.Buffer.BlockCopy(str.ToCharArray(), 0, bytes, 0, bytes.Length);
        return bytes;
    }

    static string GetString(byte[] bytes)
    {
        char[] chars = new char[bytes.Length / sizeof(char)];
        System.Buffer.BlockCopy(bytes, 0, chars, 0, bytes.Length);
        return new string(chars);
    }
}

Upvotes: 1

Views: 8009

Answers (1)

Bruno Lowagie
Bruno Lowagie

Reputation: 77528

Do not use the DeflateStream class. If you are interested in the content stream of a page (let's say of page 1), you can use this method:

byte[] streamBytes = reader.GetPageContent(1);

Where reader is an instance of the PdfReader class. Of course, this isn't sufficient if the page has Form XObjects in its resources dictionary. In that case, you'll have to use the PRStream object. For instance: if a Form XObject (or any other stream object) has object number 23, than you get the PRStream object like this:

PRStream str = (PRStream)reader.GetPdfObject(23);
byte[] bytes = PdfReader.GetStreamBytes(str);

As opposed to the GetStreamBytesRaw() method which gives you the raw, compressed bytes, the GetStreamBytes() method will decompress the stream. See iTextSharp: Convert PdfObject to PdfStream

If you don't know the number of the object you want to examine, you can walk through the PDF object tree and for instance use the GetAsStream() method of a PdfDictionary, a PdfArray, and so on.

Upvotes: 1

Related Questions