Extracting PDF Text in Hindi using PDFBox

Question

So I am trying to extract English and Hindi text from a PDF file. The English text is extracted properly. But when I try to extract the Hindi Text, some characters are replaced by circle/squares. I copied the Hindi text snippet directly from the PDF File to a Word document and I get the same squares for some characters.

PDFBox Version: 2.0.7

PDF Version: 1.6(Acrobat 7.x)

Security Details(PDF):

Font Details:

I cannot attach the PDF, but here is a snippet of the PDF File(Adobe Acrobat Reader).

Note: I have drawn the black bar as it contains the address of someone.

Output of text extracted using PDFBox:

पता: कालकाजी, दि ण िद ी, िद ी - 110019

As you can see from the output of PDFBox text extraction above, some of the characters are replaced by circles. The same happens when I manually copy from PDF to a word document.

I have tried tesseract OCR also, but that is giving an even worse output. I would like to know any other options that I can try?

For instance, extracting the data using PDFBox, not as a text but an image?

EDIT:: Also getting the following warnings.

03:58:38.711 [main] WARN o.a.pdfbox.pdmodel.font.PDType0Font - No Unicode mapping for CID+26 (26) in font Lohit-Devanagari

Extracting PDF Text in Hindi using PDFBox

Answers (1)

Related Questions