Reputation: 856
I am trying to copy some text from a PDF. But When I paste it in a word file, it is just some garbage. Something like മുഖവുര. The PDF is in Malayalam language. When I see File->Properties->Fonts, It says BRHMalayalam (Embedded Subset) as shown in the screenshot.
I installed various Malayalam fonts but still no luck. Can anyone please guide me?
The PDF I am trying to copy from is https://drive.google.com/open?id=0B3QCwY9Vanoza0tBdFJjd295WEE&authuser=0
Upvotes: 2
Views: 4540
Reputation: 1
Just convert the pdf file to word file and then edit or copy or modify the text present in the file simple :) and after completion go to file -> save as -> and change the format of doc to pdf ..hope u understood :)
Upvotes: 0
Reputation: 7046
After looking at the file, and confirming the answer already given by @KenS, the problem with this PDF document is in fact how it's constructed. Or rather how the font in the document has been embedded.
The document contains a number of Times and Arial fonts, for which the text can be copied successfully. Those fonts are embedded as a subset with a WinAnsi
encoding. What is actually in the file is close enough to that, that the text seems to copy out well.
The problem font (BRHMalayalam) is also embedded as a subset, and its encoding is also set as WinAnsiEncoding
, which completely doesn't make sense.
And because the font doesn't contain a ToUnicode
mapping table, a PDF viewer has no other choice when copying and pasting to assume the characters in the PDF are indeed Win Ansi encoding which means you end up with (garbled) latin characters.
Upvotes: 1
Reputation: 31199
Installing fonts won't help, since they are embedded in the document. The reader will use the ones in the document.
In fact it almost certainly must use the ones on the document, because it will probably have used character codes specific to each font subset.
Your PDF probably has character codes which are not Unicode values, and does not contain ToUnicode CMaps for the fonts in question (note the same font name embedded multiple times). There is no realistic way to copy the text.
The best you can do is OCR it.
Upvotes: 2