greentea
greentea

Reputation: 357

Is there an easy way to manually decode a FlateDecode Filter to extract text in a PDF? C#

I posted a question related to this a while back but got no responses. Since then, I've discovered that the PDF is encoded using FlateDecode, and I was wondering if there is a way to manually decode the PDF in C# (Windows Phone 8)? I'm getting output like the following:

%PDF-1.5
%????
1 0 obj
<<
/Type /Catalog
/Pages 2 0 R
>>
endobj
5 0 obj
<<
/Filter /FlateDecode
/Length 9
>>
stream x^+

The PDF has been created using the SyncFusion PDF controls for Windows Phone 8. Unfortunately, they do not currently have a text extraction feature, and I couldn't find that feature in other WP PDF controls either.

Basically, all I want is to download the PDF from OneDrive and read the PDF contents. Curious if this is easily doable?

Upvotes: 7

Views: 22916

Answers (2)

Pete
Pete

Reputation: 1271

private static string decompress(byte[] input)
{
    byte[] cutinput = new byte[input.Length - 2];
    Array.Copy(input, 2, cutinput, 0, cutinput.Length);

    var stream = new MemoryStream();

    using (var compressStream = new MemoryStream(cutinput))
    using (var decompressor = new DeflateStream(compressStream, CompressionMode.Decompress))
        decompressor.CopyTo(stream);

    return Encoding.Default.GetString(stream.ToArray());
}

According to below similar question the first 2 bytes of the stream has to be cut from the stream. This is done in above function. Just pass all bytes of the stream to input. Make sure the bytecount is the same as the length specified.

C# decode (decompress) Deflate data of PDF File

Upvotes: 6

Gotcha
Gotcha

Reputation: 392

The easiest solution is to use DeflateStream provided by .NET framework. Example can be found in similar thread. This approach might have some pitfalls.

If this doesn't work, there are libraries (like DotNetZip), capable of deflate stream decompression. Please check this link for performance comparison.

The last possible option I see, without reinventing wheel is to use other PDF parsing libraries and use them for stream decompression, or even for whole PDF processing.

Upvotes: 1

Related Questions