Dian
Dian

Reputation: 51

PDFtoTEXT not converting UTF-8 encoded text completely, especially the accented characters

I am working on a project which requires to convert PDF to text. The PDF contains Hindi fonts (Mangal to be specific) along with English.

100% of english is getting converted into text. The conversion of Hindi part is around 95%. Remaining 5% Hindi text is either coming as blank or like " ा". I could figure out that the accented characters are not getting converted to text properly.

I am using following code:

pdftotext -enc UTF-8 pdfname.pdf textname.txt

The PDF uses following Fonts

name, type, emb, sub, uni

ZDPKEY+Mangal, CID TrueType, yes, yes, yes

Mangal TrueType, no, no, no

Helvetica-Bold Type 1, no, no, no

CODUBM+Mangal-Bold, CID TrueType, yes, yes, yes

Mangal-Bold, TrueType, no, no, no

Times-Roman, Type 1 no, no, no

Helvetica, Type 1, no, no, no

Following is the result of conversion. Left side is original PDF. Right side is text opened in notepad:

http://preview.tinyurl.com/qbxud9o

My questions is whether the 5% missing / junk characters be correctly captured in Text with open-source packages? Would appreciate your inputs!

Upvotes: 3

Views: 2292

Answers (1)

Pavan Pyati
Pavan Pyati

Reputation: 960

Change your code to.

pdftotext -enc "UTF-8" pdfname.pdf textname.txt

It has worked for me, similarly it should work for you.

Upvotes: 4

Related Questions