Reputation: 51
I am working on a project which requires to convert PDF to text. The PDF contains Hindi fonts (Mangal to be specific) along with English.
100% of english is getting converted into text. The conversion of Hindi part is around 95%. Remaining 5% Hindi text is either coming as blank or like " ा". I could figure out that the accented characters are not getting converted to text properly.
I am using following code:
pdftotext -enc UTF-8 pdfname.pdf textname.txt
The PDF uses following Fonts
name, type, emb, sub, uni
ZDPKEY+Mangal, CID TrueType, yes, yes, yes
Mangal TrueType, no, no, no
Helvetica-Bold Type 1, no, no, no
CODUBM+Mangal-Bold, CID TrueType, yes, yes, yes
Mangal-Bold, TrueType, no, no, no
Times-Roman, Type 1 no, no, no
Helvetica, Type 1, no, no, no
Following is the result of conversion. Left side is original PDF. Right side is text opened in notepad:
http://preview.tinyurl.com/qbxud9o
My questions is whether the 5% missing / junk characters be correctly captured in Text with open-source packages? Would appreciate your inputs!
Upvotes: 3
Views: 2292
Reputation: 960
Change your code to.
pdftotext -enc "UTF-8" pdfname.pdf textname.txt
It has worked for me, similarly it should work for you.
Upvotes: 4