Reputation: 301
I have a result from iTextSharp which is able to be parsed via pdf reader, but I want to be able to take the binary content and parse it manually. I've tried taking the text between the tags <</Length 256/Filter/FlateDecode>>stream
and
endstream
and using the .NET DeflateStream class to attempt to decompress the text which resulted in this exception:
System.IO.InvalidDataException: Block length does not match with its complement. at System.IO.Compression.Inflater.DecodeUncompressedBlock(Boolean& end_of_block) at System.IO.Compression.Inflater.Decode() at System.IO.Compression.Inflater.Inflate(Byte[] bytes, Int32 offset, Int32 length) at System.IO.Compression.DeflateStream.Read(Byte[] array, Int32 offset, Int32 count) at System.IO.Stream.InternalCopyTo(Stream destination, Int32 bufferSize) at FlateDecodeTest.Decompress(Byte[] data)
My code is:
using System;
using System.Security.Cryptography;
using System.Text;
using System.Diagnostics;
using System.IO;
using System.IO.Compression;
public class FlateDecodeTest
{
public static void Main()
{
string s = @"xœuÁN!E÷|Å...";
byte[] b = Decompress(GetBytes(s));
Console.WriteLine(GetString(b));
}
public static byte[] Decompress(byte[] data)
{
Console.WriteLine(data.Length);
byte[] decompressedArray = null;
try
{
using (MemoryStream decompressedStream = new MemoryStream())
{
using (MemoryStream compressStream = new MemoryStream(data))
{
using (DeflateStream deflateStream = new DeflateStream(compressStream, CompressionMode.Decompress))
{
deflateStream.CopyTo(decompressedStream);
}
}
decompressedArray = decompressedStream.ToArray();
}
}
catch (Exception exception)
{
Console.WriteLine(exception);
}
return decompressedArray;
}
static byte[] GetBytes(string str)
{
byte[] bytes = new byte[str.Length * sizeof(char)];
System.Buffer.BlockCopy(str.ToCharArray(), 0, bytes, 0, bytes.Length);
return bytes;
}
static string GetString(byte[] bytes)
{
char[] chars = new char[bytes.Length / sizeof(char)];
System.Buffer.BlockCopy(bytes, 0, chars, 0, bytes.Length);
return new string(chars);
}
}
Upvotes: 1
Views: 8009
Reputation: 77528
Do not use the DeflateStream
class. If you are interested in the content stream of a page (let's say of page 1), you can use this method:
byte[] streamBytes = reader.GetPageContent(1);
Where reader
is an instance of the PdfReader
class. Of course, this isn't sufficient if the page has Form XObjects in its resources dictionary. In that case, you'll have to use the PRStream
object. For instance: if a Form XObject (or any other stream object) has object number 23, than you get the PRStream
object like this:
PRStream str = (PRStream)reader.GetPdfObject(23);
byte[] bytes = PdfReader.GetStreamBytes(str);
As opposed to the GetStreamBytesRaw()
method which gives you the raw, compressed bytes, the GetStreamBytes()
method will decompress the stream. See iTextSharp: Convert PdfObject to PdfStream
If you don't know the number of the object you want to examine, you can walk through the PDF object tree and for instance use the GetAsStream()
method of a PdfDictionary
, a PdfArray
, and so on.
Upvotes: 1