Reputation: 31
I have a pdf file which can not be extracted text by pdfbox or itext7. The font is encoded by Identity-H with Adobe-Identity-UCS. The details of ToUnicode are given below.
/CIDInit /ProcSet findresource begin 12 dict begin begincmap /CIDSystemInfo > def /CMapName /Adobe-Identity-UCS def /CMapType 2 def 1 begincodespacerange <0000><FFFF> endcodespacerange endcmap CMapName currentdict /CMap defineresource pop end end
The ToUnicode is invalid. Is there any way to fixed it?
I tried to download an intact Adobe-Identity-UCS cmap file and to replace it. But after a lot of google searching, I can't find the Adobe-Identity-UCS cmap file.
Any help? Thanks.
Edit:
Upvotes: 2
Views: 2430
Reputation: 96064
The ToUnicode CMap you show corresponds to the example ToUnicode CMap in the PDF specification ISO 32000 (either part), merely without any bfrange or bfchar section.
Thus, what you have essentially is a template into which one can put arbitrary mappings.
Concerning your question, therefore:
Is there any way to fixed it?
Yes and no.
Yes, you can fix it by adding the appropriate bfrange or bfchar sections with the correct mappings.
BUT... to do so you need to know which codes map to which Unicode strings respectively for the font at hand, the name Adobe-Identity-UCS by itself usually does not imply the mapping. So also:
No, not without additional information.
@Tilman in his comment to your question referenced one of his answers in which he showed how to add a missing ToUnicode map using information on the actual mappings gathered from different sources.
Upvotes: 4