Reputation: 507
So I am trying to extract English and Hindi text from a PDF file. The English text is extracted properly. But when I try to extract the Hindi Text, some characters are replaced by circle/squares. I copied the Hindi text snippet directly from the PDF File to a Word document and I get the same squares for some characters.
PDFBox Version: 2.0.7
PDF Version: 1.6(Acrobat 7.x)
Font Details:
I cannot attach the PDF, but here is a snippet of the PDF File(Adobe Acrobat Reader).
Note: I have drawn the black bar as it contains the address of someone.
Output of text extracted using PDFBox:
पता: कालकाजी, दि ण िद ी, िद ी - 110019
As you can see from the output of PDFBox text extraction above, some of the characters are replaced by circles. The same happens when I manually copy from PDF to a word document.
I have tried tesseract OCR also, but that is giving an even worse output. I would like to know any other options that I can try?
For instance, extracting the data using PDFBox, not as a text but an image?
EDIT:: Also getting the following warnings.
03:58:38.711 [main] WARN o.a.pdfbox.pdmodel.font.PDType0Font - No Unicode mapping for CID+26 (26) in font Lohit-Devanagari
Upvotes: 1
Views: 1760
Reputation: 1
If u want to extract Local Address from the Aadhaar Card pdf in text format, its a waste of time , just convert the full pdf into an image with 1600dpi which is the very high-quality image, and crop the local address as well as the local name with the dob and gender from the high quality of aadhar pdf image. i am also doing this way to make aadhar print software.
and after that remove the whitespace from the local address image and remove the white background and save it into png and use it .
Upvotes: -1