Reputation: 33
I have been trying to extract text from PDF files and most of the files seem to work fine. However, one particular document has text in this unusual font: in solid
I have tried extraction using PHP and then Python and both were unable to fix this font. I tried copying text and tried to see if I can get it fixed in text editing tools but couldn't do much.Please note that the original PDF document looks fine but when text is copied and pasted in a text editing tool, the gap between characters starts to appear. I am completely clueless on what to do. Please suggest a solution to fix this in PHP/Python (preferably PHP).
Upvotes: 1
Views: 219
Reputation: 77347
Pre-unicode, some character encodings allowed you to compose Japanese/Korean/Chinese characters either as two half width characters or one full width character. In that case, latin characters could be full width to be mixed evenly with the other characters. You have Full Width Latin characters on your hands and that's why the space out oddly.
You can normalize the string with NFKD compatibility decomposition to get to regular latin. This will also change any half/full width Japanese/Korean/Chinese characters by, um ... I'm not sure, but I think into characters built from multi code point characters.
>>> import unicodedata
>>> t="in solid"
>>> unicodedata.normalize("NFKC", t)
'in solid'
Upvotes: 2