How to parse PDF with Adobe CID characters

Question

community.

I have been trying to parse a PDF document using several tools. Such as pdfminer for Python, pdf-parse for Node.js, but none of them can parse a spacial Adobe CID characters, and I get the following sequence.

(cid:411)(cid:579)(cid:556)(cid:851)(cid:411)(cid:579)

Is there a tool that makes it possible to parse these characters?

mkl · Accepted Answer

In a comment you provided an example:

I attach the pdf file. For example, the line POLLEN ALLERGY is not being parsed correctly.

In your PDF file the heading "11. POLLEN ALLERGY" is drawn using this command:

<003900390048000300130012000f000f0008001100030004000f000f00080015000a001c> Tj

The active font when it is drawn is a composite font with an Identity-H encoding, a ToUnicode map without mappings, and an Adobe-Identity-0 ROS. So essentially all one knows is that it's horizontally drawn and double-byte. (Thus, in the instruction above you can split the hex string into subsequences of 4 hex digits each to get the character codes for all the glyphs.)

Text extraction according to section 9.10.2 - Mapping Character Codes to Unicode Values - of the PDF specification ISO 32000-1, therefore, for each glyph leads to the final

If these methods fail to produce a Unicode value, there is no way to determine what the character code represents in which case a conforming reader may choose a character code of their choosing.

Thus, the reason why the line POLLEN ALLERGY is not being parsed correctly, simply is that the PDF does not contain the information required for text extraction based on PDF information only.

This also shows in Adobe Acrobat Reader, copy&paste of that line also returns nothing intelligible.

There is one option, though, to correctly text extract: You need a text extractor which looks beyond the information in PDF syntax and into the embedded font program for text extraction. Here there indeed are correct mappings from glyph to Unicode codepoint.

I don't know, though, which - if any - python text extractors do use that extra information.

How to parse PDF with Adobe CID characters

Answers (1)

Related Questions