ThatMSG
ThatMSG

Reputation: 1506

Messed up special chars when reading PDF

I'm using a little PHP class (pdf2text) to open and read a "text" PDF.

Currently I can't get it to handle special chars like è, ä, ö, ü and so on correctly. I tried to set the header to UTF-8 and encode the receiving data to UTF-8 but they still won't display correctly.

The Class can be found here: http://pastebin.com/PSmu03nH

If someone has any further ideas or even a solution, please let me know.

Upvotes: 1

Views: 855

Answers (1)

mkl
mkl

Reputation: 95918

In a nutshell:

The PDF2Text class you use ignores a big part of the PDF specification ISO-32000-1:2008. It works for very special circumstances only.

To sligthly improve the results of decoding special characters (umlauts, accented characters, ...) as mentioned in your question, you might want to add translation according to Annex D Character Sets and Encodings of the PDF specification.

In detail:

decodePDF goes over the objects in the PDF and selects the stream objects. Here it completely ignores whether those objects are still in use or not (i.e. in a document often revised streams from all revisions are seen).

From these streams it removes all with a Length1, Type, or SubType key. The (good) intention is to remove stream which contain other stuff than page content. The unfortunate result is that object streams are removed, too; object streams are part of the PDF specification since PDF 1.5 and bundle multiple other objects of any kind, including streams; they offer better compression properties than regular, top-level objects. Thus, contents of documents making use of this feature are lost here.

In the remaining streams it checks whether they contain text objects or not. If they contain text objects BT ... ET, the inner contents of these objects are processed by getDirtyTexts. If they don't, they are processed by getCharTransformations.

getDirtyTexts collects the string arguments of the text operators TJ and Tj; on the one hand this means that it ignores the arguments of the text operators ' and " and furthermore any information on how these strings are positioned in relation to each other. The content of documents with extensive use of kerning information and of documents using such operations instead of spaces in strings to separate words, may therefore be completely unreadable hereafter. Also the operations selecting fonts are thrown away here --- but as the streams are not connected to their respective ressources objects here at all, font information could not be matched anyways...

getCharTransformations assumes the stream to be a ToUnicode mapping stream and adds all the mappings from all these streams into a single map. As multiple ToUnicode streams, if present, most likely belong to different fonts and may have completely different mappings, putting them all in one map will lose a lot of mapping information unless the fonts in question were arranged to have non-overlapping character identifier ranges... why should they!

Now decodePDF calls getTextUsingTransformations to process the results of those two methods. It walks through the strings extracted by getDirtyTexts. If they are hex-encoded, they are decoded and then translated using the mappings extracted by getCharTransformations. If they are not hex-encoded, they are copied as is without any further translations.

Thus, the contents of hex-encoded strings are interpreted according to some ToUnicode mapping which might or might not be the encoding associated to the respective font they are used with, and the contents of not hex-encoded strings are used as is, completely ignoring the encoding of their respective font.

In essence, therefore, the only PDFs this class can somewhat successfully be used with, have to use a standard encoding for fonts used with not hex-encoded strings (the standard encodings up to character code 127 are ASCII-ish) and encodings with identical mappings where their character code ranges overlap for fonts used with hex-encoded strings.

To sligthly improve the results of decoding special characters from not hex-encoded strings, you might want to add translation according to Annex D Character Sets and Encodings of the PDF specification.

Upvotes: 3

Related Questions