Reputation: 233
I have this PDF file, which is in Greek. A known problem occurs when trying to copy and paste text from it, resulting in slight gibberish. The reason I say slight instead of total, is that while the pasted output does not make sense in Greek, it is comprised of valid greek characters. Also, an interesting aspect to the problem is that not all characters are mapped wrong. For example, if you compare this original strip of text
ΕΞ. ΕΠΕΙΓΟΝ – ΑΜΕΣΗ ΕΦΑΡΜΟΓΗ
ΝΑ ΣΤΑΛΕΙ ΚΑΙ ΜΕ Ε-ΜΑIL
with the pasted one from the PDF:
ΔΞ. ΔΠΔΙΓΟΝ – ΑΜΔΗ ΔΦΑΡΜΟΓΗ
ΝΑ ΣΑΛΔΙ ΚΑΙ ΜΔ Δ-ΜΑIL
you will notice that some of the characters are correctly pasted, while others are not. It might also be worthwhile to mention that the wrong characters are reflexively mapped wrong, e.g. Ε becomes Δ and vice-versa.
When I open the PDF using e.g. Adobe, and print it using a PDF writer, in this case CutePDF, the output when copying and pasting is correct!
Given the above, my questions are the following:
EDIT: a few typos
Upvotes: 4
Views: 1942
Reputation: 4871
Some basic context:
Displaying text in PDF is done by selecting glyphs from a font. A glyph is the visual representation of one or more characters. Glyph selection is done using character codes. For text extraction, you need to know which characters correspond with a character code.
In this case, this is achieved using a ToUnicode CMap.
In this document, the first letter of the text snippet, E, is displayed like this:
[0x01FC, ...] TJ
The ToUnicode CMap contains this entry:
4 beginbfrange
<01f9> <01fc> <0391>
...
endbfrange
This means that character codes 0x01F9
, 0x01FA
, 0x01FB
and 0x01FC
are mapped to Unicode U+0x391
, U+0x392
, U+0x393
and U+0x394
respectively.
U+0394 is the Greek delta, Δ, that shows up when copy/pasting.
The next letter is painted using character code 0x0204
. The relevant ToUnicode entry is <0200> <020b> <039a>
, which maps it correctly to U+039E
So, you're getting slight gibberish, because only some of the Unicode mapping is wrong. Sometimes this is done on purpose, e.g. to prevent data mining. I have seen it before in financial reports.
Upvotes: 2