Reputation: 101

Issues Decoding Flate from PDF Embedded Font

Ok, before we start. I work for a company that has a license to redistribute PDF files from various publishers in any media form. So, that being said, the extraction of embedded fonts from the given PDF files is not only legal - but also vital to the presentation.

I am using code found on this site, however I do not recall the author, when I find it I will reference them. I have located the stream within the PDF file that contains the embedded fonts, I have isolated this encoded stream as a string and then into a byte[]. When I use the following code I get an error

Block length does not match with its complement.

Code (the error occurs in the while line below):

private static byte[] DecodeFlateDecodeData(byte[] data)
{
    MemoryStream outputStream;
    using (outputStream = new MemoryStream())
    {
        using (var compressedDataStream = new MemoryStream(data))
        {
            // Remove the first two bytes to skip the header (it isn't recognized by the DeflateStream class)
            compressedDataStream.ReadByte();
            compressedDataStream.ReadByte();

            var deflateStream = new DeflateStream(compressedDataStream, CompressionMode.Decompress, true);
            var decompressedBuffer = new byte[compressedDataStream.Length];
            int read;

            // The error occurs in the following line
            while ((read = deflateStream.Read(decompressedBuffer, 0, decompressedBuffer.Length)) != 0)
            {
                outputStream.Write(decompressedBuffer, 0, read);
            }
            outputStream.Flush();
            compressedDataStream.Close();
        }

        return ReadFully(outputStream);
    }
}

After using the usual tools (Google, Bing, archives here) I found that the majority of the time that this occurs is when one has not consumed the first two bytes of the encoding stream - but this is done here so i cannot find the source of this error. Below is the encoded stream:

H‰LT}lg?7ñù¤aŽÂ½ãnÕ´jh›Ú?-T’ÑRL–¦
ëš:Uí6Ÿ¶“ø+ñ÷ùü™”ÒÆŸŸíóWlÇ±“ºu“°tƒ¦t0ÊD¶jˆ
Ö   m:$½×^*qABBï?Þç÷|ýÞßóJÖˆD"yâP—òpgÇó¦Q¾S¯9£Û¾mçÁçÚ„cÂÛO¡É‡·¥ï~á³ÇãO¡ŸØö=öPD"d‚ìA—$H'‚DC¢D®¤·éC'Å:È—€ìEV%cÿŽS;þÔ’kYkùcË_ZÇZ/·þYº(ýÝ‡Ã_ó3m¤[3¤²4ÿo?²õñÖ*Z/Þiãÿ¿¾õ8Ü    ?»„O Ê£ðÅP9ÿ•¿Â¯*–z×No˜0ãÆ-êàîoR‹×ÉêÊêÂulaƒÝü

Please help, I am beating my head against the wall here!

NOTE: The stream above is the encoded version of Arial Black - according to the specs inside the PDF:

661 0 obj
<< 
/Type /FontDescriptor 
/FontFile3 662 0 R 
/FontBBox [ -194 -307 1688 1083 ] 
/FontName /HLJOBA+ArialBlack 
/Flags 4 
/StemV 0 
/CapHeight 715 
/XHeight 518 
/Ascent 0 
/Descent -209 
/ItalicAngle 0 
/CharSet (/space/T/e/s/t/a/k/i/n/g/S/r/E/x/m/O/u/l)
>> 
endobj
662 0 obj
<< /Length 1700 /Filter /FlateDecode /Subtype /Type1C >> 
stream
H‰LT}lg?7ñù¤aŽÂ½ãnÕ´jh›Ú?-T’ÑRL–¦
ëš:Uí6Ÿ¶“ø+ñ÷ùü™”ÒÆŸŸíóWlÇ±“ºu“°tƒ¦t0ÊD¶jˆ
Ö   m:$½×^*qABBï?Þç÷|ýÞßóJÖˆD"yâP—òpgÇó¦Q¾S¯9£Û¾mçÁçÚ„cÂÛO¡É‡·¥ï~á³ÇãO¡ŸØö=öPD"d‚ìA—$H'‚DC¢D®¤·éC'Å:È—€ìEV%cÿŽS;þÔ’kYkùcË_ZÇZ/·þYº(ýÝ‡Ã_ó3m¤[3¤²4ÿo?²õñÖ*Z/Þiãÿ¿¾õ8Ü    ?»„O Ê£ðÅP9ÿ•¿Â¯*–z×No˜0ãÆ-êàîoR‹×ÉêÊêÂulaƒÝü

Upvotes: 1

Answers (3)

Thomas Hawkins

Reputation: 101

Okay, for anyone who might stumble across this issue themselves allow me to warn you - this is a rocky road without a great deal of good solutions. I eventually moved away from writing all of the code to extract the fonts myself. I simply downloaded MuPDF (open source) and then made command line calls to mutool.exe:

    mutool extract C:\mypdf.pdf

This pulls all of the fonts into the folder mutool resides in (it also extracts some images (these are the fonts that could not be converted (usually small subsets I think))). I then wrote a method to move those from that folder into the one I wanted them in.

Of course, to convert these to anything usable is a headache in itself - but I have found it to be doable.

As a reminder, font piracy IS piracy.

Upvotes: 0

David van Driessche

Reputation: 7046

Looks like GetStreamBytes() might solve your problem out right, but let me point out that I think you're doing something dangerous concerning end-of-line markers. The PDF Specification in 7.3.8.1 states that:

The keyword stream that follows the stream dictionary shall be followed by an end-of-line marker consisting of either a CARRIAGE RETURN and a LINE FEED or just a LINE FEED, and not by a CARRIAGE RETURN alone.

In your code it looks like you always skip two bytes while the spec says it could be either one or two (CR LF or LF).

You should be able to catch whether you are running into this by comparing the exact number of bytes you want to decode with the value of the (Required) "Length" key in the stream dictionary.

Upvotes: 1

Bruno Lowagie

Reputation: 77606

Is there a particular reason why you're not using the GetStreamBytes() method that is provided with iText? What about data? Are you sure you are looking at the correct bytes? Did you create the PRStream object correctly and did you get the bytes with PdfReader.GetStreamBytesRaw()? If so, why decode the bytes yourself? Which brings me to my initial counter-question: is there a particular reason why you're not using the GetStreamBytes() method?

Upvotes: 1

Issues Decoding Flate from PDF Embedded Font

Answers (3)

Related Questions