millenseed
millenseed

Reputation: 233

Copy pasting from PDF is gibberish on original file, but fixed when printing the pdf using a CutePDF

I have this PDF file, which is in Greek. A known problem occurs when trying to copy and paste text from it, resulting in slight gibberish. The reason I say slight instead of total, is that while the pasted output does not make sense in Greek, it is comprised of valid greek characters. Also, an interesting aspect to the problem is that not all characters are mapped wrong. For example, if you compare this original strip of text

ΕΞ. ΕΠΕΙΓΟΝ – ΑΜΕΣΗ ΕΦΑΡΜΟΓΗ
ΝΑ ΣΤΑΛΕΙ ΚΑΙ ΜΕ Ε-ΜΑIL

with the pasted one from the PDF:

ΔΞ. ΔΠΔΙΓΟΝ – ΑΜΔ΢Η ΔΦΑΡΜΟΓΗ
ΝΑ ΢ΣΑΛΔΙ ΚΑΙ ΜΔ Δ-ΜΑIL

you will notice that some of the characters are correctly pasted, while others are not. It might also be worthwhile to mention that the wrong characters are reflexively mapped wrong, e.g. Ε becomes Δ and vice-versa.

When I open the PDF using e.g. Adobe, and print it using a PDF writer, in this case CutePDF, the output when copying and pasting is correct!

Given the above, my questions are the following:

  1. What is the root cause of this behavior?
  2. How would I go about integrating a solution into a java-based workflow for randomly imported PDF files?

EDIT: a few typos

Upvotes: 4

Views: 1942

Answers (1)

rhens
rhens

Reputation: 4871

Some basic context:

Displaying text in PDF is done by selecting glyphs from a font. A glyph is the visual representation of one or more characters. Glyph selection is done using character codes. For text extraction, you need to know which characters correspond with a character code.

In this case, this is achieved using a ToUnicode CMap.

In this document, the first letter of the text snippet, E, is displayed like this:

[0x01FC, ...] TJ

The ToUnicode CMap contains this entry:

4 beginbfrange
<01f9> <01fc> <0391>
...
endbfrange

This means that character codes 0x01F9, 0x01FA, 0x01FB and 0x01FC are mapped to Unicode U+0x391, U+0x392, U+0x393 and U+0x394 respectively.

U+0394 is the Greek delta, Δ, that shows up when copy/pasting.

The next letter is painted using character code 0x0204. The relevant ToUnicode entry is <0200> <020b> <039a>, which maps it correctly to U+039E

So, you're getting slight gibberish, because only some of the Unicode mapping is wrong. Sometimes this is done on purpose, e.g. to prevent data mining. I have seen it before in financial reports.

Upvotes: 2

Related Questions