Abhinav P
Abhinav P

Reputation: 507

Extracting PDF Text in Hindi using PDFBox

So I am trying to extract English and Hindi text from a PDF file. The English text is extracted properly. But when I try to extract the Hindi Text, some characters are replaced by circle/squares. I copied the Hindi text snippet directly from the PDF File to a Word document and I get the same squares for some characters.

PDFBox Version: 2.0.7

PDF Version: 1.6(Acrobat 7.x)

Security Details(PDF): enter image description here

Font Details:

enter image description here

I cannot attach the PDF, but here is a snippet of the PDF File(Adobe Acrobat Reader).

PDF File Snippet

Note: I have drawn the black bar as it contains the address of someone.

Output of text extracted using PDFBox:

पता: कालकाजी, दि ण िद ी, िद ी - 110019

As you can see from the output of PDFBox text extraction above, some of the characters are replaced by circles. The same happens when I manually copy from PDF to a word document.

I have tried tesseract OCR also, but that is giving an even worse output. I would like to know any other options that I can try?

For instance, extracting the data using PDFBox, not as a text but an image?

EDIT:: Also getting the following warnings.

03:58:38.711 [main] WARN o.a.pdfbox.pdmodel.font.PDType0Font - No Unicode mapping for CID+26 (26) in font Lohit-Devanagari

Upvotes: 1

Views: 1760

Answers (1)

Ankit Kumar
Ankit Kumar

Reputation: 1

If u want to extract Local Address from the Aadhaar Card pdf in text format, its a waste of time , just convert the full pdf into an image with 1600dpi which is the very high-quality image, and crop the local address as well as the local name with the dob and gender from the high quality of aadhar pdf image. i am also doing this way to make aadhar print software.

and after that remove the whitespace from the local address image and remove the white background and save it into png and use it .

Upvotes: -1

Related Questions