TacB0sS
TacB0sS

Reputation: 10266

Reading PDF Literal String parsing dilemma

I have the following contents in the same PDF page, in different ObjectX:

First:

[(some text)] TJ ET Q
[(some other text)] TJ ET Q

Very simple and basic so far...

The second:

[( H T M L   E x a m p l e)] TJ ET Q
[( S o m e   s p e c i a l   c h a r a c t e r s :   <   ¬   ¬   ¬   &   ט   ט   ©   >   \\ s l a s h   \\ \\ d o u b l e - s l a s h   \\ \\ \\ t r i p l e - s l a s h  )] TJ ET Q

NOTE: It is not noticeable in text above, but:

'H T M L E x a m p l e' is actually 0H0T0M0L0[32]0E0x0a0m0p0l0e where each 0 is a literal value 0 == ((char)0) so if I ignore all the 0 values, this actually turns to be like the upper example...

Some Bytes:

htmlexample == [0, 72, 0, 84, 0, 77, 0, 76, 0, 32, 0, 69, 0, 120, 0, 97, 0, 109, 0, 112, 0, 108, 0, 101]
<content>  == [0, 32, 32, -84, 0, 32, 32, -84, 0, 32, 32, -84, 0, 32, 0, 38, 0, 32, 0, -24, 0, 32, 0, -24, 0, 32, 0, -87, 0, 32, 0]

But in the next line I need to combine every two bytes into a char because of the following:

< ¬ ¬ ¬...> is actually <0[32][32]¬0[32][32]¬0[32][32]¬...> where the combination of [32]¬ is €

The problem I'm facing is not the conversion itself I use: new String(sb.toString().getBytes("UTF-8"),"UTF-16BE")

The problem is to know when to apply it and when to keep the UTF-8.

== UPDATE ==

The font used for the problematic Object is:

#7 0# {
    'Name' : "F4"
    'BaseFont' : "AAAAAE+DejaVuSans-Bold"
    'Subtype' : "Type0"
    'ToUnicode' : #41 0# {
        'Filter' : "FlateDecode"
        'Length' : 1679.0f
    } + Stream(5771 bytes)
    'Encoding' : "Identity-H"
    'DescendantFonts' : [#42 0# {
        'FontDescriptor' : #43 0# {
            'MaxWidth' : 2016.0f
            'AvgWidth' : 573.0f
            'FontBBox' : [-1069.0f, -415.0f, 1975.0f, 1174.0f]
            'MissingWidth' : 600.0f
            'FontName' : "AAAAAE+DejaVuSans-Bold"
            'Type' : "FontDescriptor"
            'CapHeight' : 729.0f
            'StemV' : 60.0f
            'Leading' : 0.0f
            'FontFile2' : #34 0# {
                'Filter' : "FlateDecode"
                'Length1' : 83036.0f
                'Length' : 34117.0f
            } + Stream(83036 bytes)
            'Ascent' : 928.0f
            'Descent' : -236.0f
            'XHeight' : 547.0f
            'StemH' : 26.0f
            'Flags' : 32.0f
            'ItalicAngle' : 0.0f
        }
        'Subtype' : "CIDFontType2"
        'W' : [32.0f, [348.0f, 456.0f, 521.0f, 838.0f, 696.0f, 1002.0f, 872.0f, 306.0f, 457.0f, 457.0f, 523.0f, 838.0f, 380.0f, 415.0f, 380.0f, 365.0f], 48.0f, 57.0f, 696.0f, 58.0f, 59.0f, 400.0f, 60.0f, 62.0f, 838.0f, 63.0f, [580.0f, 1000.0f, 774.0f, 762.0f, 734.0f, 830.0f, 683.0f, 683.0f, 821.0f, 837.0f, 372.0f, 372.0f, 775.0f, 637.0f, 995.0f, 837.0f, 850.0f, 733.0f, 850.0f, 770.0f, 720.0f, 682.0f, 812.0f, 774.0f, 1103.0f, 771.0f, 724.0f, 725.0f, 457.0f, 365.0f, 457.0f, 838.0f, 500.0f, 500.0f, 675.0f, 716.0f, 593.0f, 716.0f, 678.0f, 435.0f, 716.0f, 712.0f, 343.0f, 343.0f, 665.0f, 343.0f, 1042.0f, 712.0f, 687.0f, 716.0f, 716.0f, 493.0f, 595.0f, 478.0f, 712.0f, 652.0f, 924.0f, 645.0f, 652.0f, 582.0f, 712.0f, 365.0f, 712.0f, 838.0f], 160.0f, [348.0f, 456.0f, 696.0f, 696.0f, 636.0f, 696.0f, 365.0f, 500.0f, 500.0f, 1000.0f, 564.0f, 646.0f, 838.0f, 415.0f, 1000.0f, 500.0f, 500.0f, 838.0f, 438.0f, 438.0f, 500.0f, 736.0f, 636.0f, 380.0f, 500.0f, 438.0f, 564.0f, 646.0f], 188.0f, 190.0f, 1035.0f, 191.0f, 191.0f, 580.0f, 192.0f, 197.0f, 774.0f, 198.0f, [1085.0f, 734.0f], 200.0f, 203.0f, 683.0f, 204.0f, 207.0f, 372.0f, 208.0f, [838.0f, 837.0f], 210.0f, 214.0f, 850.0f, 215.0f, [838.0f, 850.0f], 217.0f, 220.0f, 812.0f, 221.0f, [724.0f, 738.0f, 719.0f], 224.0f, 229.0f, 675.0f, 230.0f, [1048.0f, 593.0f], 232.0f, 235.0f, 678.0f, 236.0f, 239.0f, 343.0f, 240.0f, [687.0f, 712.0f, 687.0f, 687.0f, 687.0f, 687.0f, 687.0f], 247.0f, [838.0f, 687.0f], 249.0f, 252.0f, 712.0f, 253.0f, [652.0f, 716.0f]]
        'Type' : "Font"
        'BaseFont' : "AAAAAE+DejaVuSans-Bold"
        'CIDSystemInfo' : {
            'Supplement' : 0.0f
            'Ordering' : "Identity" + Stream(8 bytes)
            'Registry' : "Adobe" + Stream(5 bytes)
        }
        'DW' : 600.0f
        'CIDToGIDMap' : #44 0# {
            'Filter' : "FlateDecode"
            'Length' : 10200.0f
        } + Stream(131072 bytes)
    }]
    'Type' : "Font"
}

There is no indication to the encoding type of the font.

== Update ==

As for the ToUnicode object, in the case of these font it is an unnecessary it should have been Identity-H but instead it is an X == X mapping here are some examples that goes from until FFFF:

<0000> <00ff> <0000>
<0100> <01ff> <0100>
<0200> <02ff> <0200>
<0300> <03ff> <0300>
<0400> <04ff> <0400>
<0500> <05ff> <0500>
<0600> <06ff> <0600>
<0700> <07ff> <0700>
<0800> <08ff> <0800>
<0900> <09ff> <0900>
<0a00> <0aff> <0a00>
<0b00> <0bff> <0b00>
<0c00> <0cff> <0c00>
<0d00> <0dff> <0d00>
<0e00> <0eff> <0e00>
<0f00> <0fff> <0f00>
<1000> <10ff> <1000>
<1100> <11ff> <1100>
....
....
....
<fc00> <fcff> <fc00>
<fd00> <fdff> <fd00>
<fe00> <feff> <fe00>
<ff00> <ffff> <ff00>

So the mapping is not in the ToUnicode object, but still other renderers can render it well!

Any Ideas?

Upvotes: 1

Views: 915

Answers (2)

TacB0sS
TacB0sS

Reputation: 10266

OK, So as this seems to be complicated, and the reason for this bug is stupid, especially on my end, but there is a lesson to be learned with regards to when to treat the chars as UTF-16, and when not to.

My problem was not while parsing the fonts, but while rendering them. according to the details specified in the Font object you can determine the type of the font and apply the correct logic to it.

Upvotes: 0

mkl
mkl

Reputation: 96039

I use: new String(sb.toString().getBytes("UTF-8"),"UTF-16BE")

The problem is to know when to apply it and when to keep the UTF-8.

The OP assumes, probably after examining some sample PDF files, that strings in PDF content streams are encoded using either UTF-8 or UTF-16BE.

This assumption is wrong.

PDF allows some standard single-byte encodings (MacRomanEncoding, MacExpertEncoding, and WinAnsiEncoding) none of which is UTF-8 (due to relations between different encodings, especially ASCII, Latin1, and UTF-8, they may be confused with each other when confronted with a limited sample). Furthermore numerous predefined multi-byte encodings are also allowed, some of which are indeed UTF-16-related..

But PDF allows completely custom encodings, both single-byte and multi-byte, to be used, too!

E.g. this text drawing operation

(ABCCD) Tj

for a simple font with this encoding:

<<
 /Type /Encoding
 /Differences [ 65 /H /e /l /o ] 
>>

displays the word Hello!

And while this may look like an artificially constructed example, the procedure to create a custom encoding like this (i.e. by assigning codes from some start value upwards to glyphs in the order in which they first occur on the page or in the document) is fairly often used.

Furthermore, the OP's current solution

If your font object has a CMap, then you treat it as a UTF-16, otherwise not.

will only work for a very few documents because

a) simple fonts (using single-byte encodings) may also supply a ToUnicode CMap and b) composite fonts CMaps also need not be UTF-like but instead can use a mixed multi-byte encoding.

Thus, there is no way around an in-depth analysis of the used font information, cf. 9.5..9.9 of the PDF specification ISO 32000-1.

PS On some comments by the OP:

this: new String(sb.toString().getBytes("UTF-8"),"UTF-16BE") was an example to the how the problem is solved not a solution! The solution is done while fetching the glyphs whether I treat the data as 16-bit or 8-bit

and

the ToUnicode map is 16-bit(The only ones I've seen) per key,

The data may be mixed data, e.g. have a look at the Adobe CMap and CIDFont Files Specification, here the CMap example 9 contains the section

4 begincodespacerange
<00> <80>
<8140> <9ffc>
<a0> <de>
<e040> <fbec>
endcodespacerange

which is explained to mean

Figure 6 shows how the codespace definition in this example comprises two single-byte linear ranges of codes (<00> to <80> and <A0> to <DF>) and two double-byte rectangular ranges of codes (<8140> to <9FFC> and <E040> to <FBFC>). The first two-byte region comprises all codes bounded by first-byte values of 81 through 9F and second-byte values of 40 through FC. Thus, the input code <86A9> is within the region because both bytes are within bounds. That code is valid. The input code <8210> is not within the region, even though its first byte is between 81 and 9F, because its second byte is not within bounds. That code is invalid. The second two-byte region is similarly bounded.

Figure 6 Codespace ranges for the 83pv-RKSJ-H charset encoding

Upvotes: 4

Related Questions