Reputation: 60
i got an PDF 1.3 where I want to extract the text. But in the stream there are 2 different types of text. Some plain text and some character coded text with escape sequences.
Here an example:
/TextClip BMC
BT
/T1_2 1 Tf
0 Tc 0 Tw 7 Tr 16.2626 0 0 16.2626 37.2512 581.738 Tm
(Test Test)Tj
ET
EMC
q
/GS0 gs
67.6799985 0 0 -13.4399997 37.439994 594.2399583 cm
/Im47 Do
Q
Q
Q
q
37.499 569.52 179.713 8.34 re
W n
q
/GS0 gs
180.959996 0 0 -9.5999998 36.959999 578.3999755 cm
/Im48 Do
Q
Q
q
37.499 569.52 179.713 8.34 re
W n
q
/TextClip BMC
BT
0 Tc 0 Tw 7 Tr 9.899 0 0 9.899 37.2512 569.7178 Tm
[(\000E\000V\000d\000e\000\003\000E\000V\000d\000e)]TJ
ET
EMC
In this example ther are 2 times the text "Test Test". One time as plan text and the other time with the escape sequence \000E\000V\000d\000e\000\003\000E\000V\000d\000e
.
I only knew, if there are after an escape sequence 3 digits, that this is an octal character code. But in my example there are some time 4 and some times 3 digits.
The 4 character after the escape sequence is at 15 next to the correct ascii code. (\000E
is character "T") But what is the correct conversion?
The text block \000\003
should be a space sign. What is there the conversion hack?
Regards
Upvotes: 3
Views: 1226
Reputation: 95963
The encoding of the string arguments of text showing instructions like TJ and Tj depends on the PDF font in question, cf. the specification
A string operand of a text-showing operator shall be interpreted as a sequence of character codes identifying the glyphs to be painted.
With a simple font, each byte of the string shall be treated as a separate character code. The character code shall then be looked up in the font’s encoding to select the glyph, as described in 9.6.6, "Character Encoding".
With a composite font (PDF 1.2), multiple-byte codes may be used to select glyphs. In this instance, one or more consecutive bytes of the string shall be treated as a single character code. The code lengths and the mappings from codes to glyphs are defined in a data structure called a CMap, described in 9.7, "Composite Fonts".
(section 9.4.3 - Text-Showing Operators - in ISO 32000-1)
The font used for the first text showing operation
(Test Test)Tj
probably is a simple font with an ASCII'ish encoding, probably WinAnsiEncoding. The font itself is selected two lines above in
/T1_2 1 Tf
so you only have to look up the font resource T1_2 the associated resources (the resources of the page if you are showing us an excerpt of a page content stream) to verify.
The font used in the second text showing operation
[(\000E\000V\000d\000e\000\003\000E\000V\000d\000e)]TJ
appears to be a composite font with a double-byte encoding, probably Identity-H, and the underlying font program appears to have the glyph codes most often found in TrueType fonts. You should look for a ToUnicode mapping in that PDF font for easy decoding.
The instruction in which this font is selected, is not among the instructions you posted but instead must be somewhere above. This selection has been saved as part of the graphics state (in some early q instructions) and restored again (in some Q instruction between the two text showing instructions you shared).
if there are after an escape sequence 3 digits, that this is an octal character code. But in my example there are some time 4 and some times 3 digits.
No, in your example there always are escape sequences with three octal digits. The character thereafter is a separate byte, i.e. you have the bytes '\000', 'E', '\000', 'V', '\000', 'd', '\000', 'e', '\000', '\003', '\000', 'E', '\000', 'V', '\000', 'd', '\000', and 'e'.
As mentioned above, this looks like a double-byte encoding with in particular the mappings
\000E -> 'T'
\000V -> 'e'
\000d -> 's'
\000e -> 't'
\000\003 -> ' ' (space)
This appears to be a glyph encoding often found in TrueType fonts which for Latin letters merely means a constant offset to their Unicode codes.
But there also are many different multi-byte encodings in common use, sometimes they even are ad-hoc encodings only created for the font on the page at hands.
Thus, if you seriously want to do text extraction from PDFs, you really have to study the PDF specification and implement along its requirements instead of hoping for some conversion hack.
Adobe has published a copy of the old PDF specification ISO 32000-1 on their web page at https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/PDF32000_2008.pdf
Upvotes: 3