Reputation: 431
How to extract the mapping from Character ID's (CID) to glyph instructions in an embedded CID font of a PDF?
I have a large collection of PDFs, some of which have faulty \ToUnicode CMAP data, which are causing problems in extracting text from the files.
Since the rendered pages seem OK, I'd like to understand the /FontFile2 stream object (an embedded, CID type font based on OpenType) contained in the PDFs. It is probably enough just to be able to parse the stream into a mapping from CIDs to glyph instructions, without understanding how to interpret the instructions.
(The CIDs keep shifting around from one file to the next in the collection, even though there are only about half a dozen fonts or so. So I'm hoping that, even without understanding how to interpret the glyph instructions, I will be able to identify them uniquely and fix the \ToUnicode mapping by comparing faulty and correct mappings, perhaps even just applying a simple majority rule to determine the mapping "glyph instructions" -> Unicode, and using that to correct the mappings of individual files. If you see any problem with this approach, let me know!)
This question is similar in spirit, but my question has a different focus: I just want to be able to map a CID to some globally unique signature (e.g. the hash value of the instructions describing that glyph).
I guess the answer is hidden somewhere in the CID font specification, but I was hoping to avoid reading it...
One of the files is a PDF; here are some of the relevant objects:
31 0 obj
<<
/CIDSystemInfo 32 0 R
/CIDToGIDMap /Identity
/Subtype /CIDFontType2
/Type /Font
/W 33 0 R
/FontDescriptor 34 0 R
/DW 1000
/BaseFont /ABCDEE+David,Bold
>>
endobj
34 0 obj
<<
/Descent -265
/FontWeight 700
/StemV 52
/FontName /ABCDEE+David,Bold
/Ascent 735
/ItalicAngle 0
/AvgWidth 521
/FontBBox [-195 -265 1009 735]
/Type /FontDescriptor
/CapHeight 735
/Flags 32
/FontFile2 35 0 R
/MaxWidth 1205
/XHeight 250
>>
endobj
35 0 obj
<<
/Length1 53608
/Length 53608>>
[Omitted Stream]
If possible, I'd like to extraxct from the [Omitted Stream]
just enough information so as to be able to identify which set of instructions each CID code will invoke.
Upvotes: 1
Views: 1267
Reputation: 524
Edit: just open the PDF with FontForge and select the font.
You can use PDFBox's PDF debugger. It allows you to check the glyphs directly.
Alternatively, you can use PDFBox's PDF debugger to save the FontFile2 as a TTF file, then check it in FontForge. Steps:
Upvotes: 1
Reputation: 1215
Acrobat DC Pro has a tool called Preflight which is quite powerful for many different things, which has an option "Browse Internal Structure of All Fonts". This actually allows one to quickly and visually examine an embedded font stream. This is useful to use in conjunction with writing code to parse an Embedded Font Program, it won't be able to tell you everything you need to know to write a parser but it is certainly helpful to 'see' the glyphs or poke around a font as an academic exercise.
If you haven't already, it may be a good idea to validate if there is anything wrong with the fonts versus a deficiency in the Text Extraction tool you are using. E.g. attempt to use alternative PDF software to do the text extraction.
Upvotes: 1
Reputation: 95898
FontFile2 is specified as
FontFile2 stream (Optional; PDF 1.1) A stream containing a TrueType font program (see 9.9, "Embedded Font Programs").
(ISO 32000-1, Table 122 – Entries common to all font descriptors)
FontFile2 — (PDF 1.1) TrueType font program, as described in the TrueType Reference Manual. This entry may appear in the font descriptor for a TrueType font dictionary or (PDF 1.3) for a CIDFontType2CIDFont dictionary.
(ISO 32000-1, Table 126 – Embedded font organization for various font types)
Thus, to “see” individual glyphs in a /FontFile2 stream object of a PDF simply parse the font file from the FontFile2 stream using a font library supporting TrueType fonts for your programming and runtime environment. Such a font library should provide means to “see” individual glyphs.
Beware: In the context of PDFs not all font file features are needed. This causes numerous PDF creators to strip the font files to the actually needed information. The font library you use, therefore, should allow for some minor missing information.
Upvotes: 0