user2966560
user2966560

Reputation: 379

PdfBox text extraction not working properly

PDFTextStripper stripper = new PDFTextStripper();
PDDocument document = PDDocument.load(inputStream);
String text = stripper.getText(document);

Extracted text: http://pastebin.com/BXFfMy0z

Problem pdf: http://www.iwb.ch/media/Unternehmen/Dokumente/inserat_leiter_pm.pdf

What can I do to extract correct text from this pdf file?

Upvotes: 1

Views: 3473

Answers (3)

stanlyF
stanlyF

Reputation: 266

The original file should contain mapping to Unicode. This part is absent, thus you have got broken text after extraction.

Upvotes: 0

mkl
mkl

Reputation: 95918

In addition to @karthik27's answer:

Adobe Reader is fairly good at text extraction and, therefore, generally can be used as an indicator whether text extraction from a given document is possible at all.

Thus, whenever you have a document your own text extraction cannot handle, open it in the Reader and try copying & pasting from it. If that results in garbage, most likely it is not authored properly for text extraction, either by mistake or by design.

In case of your document I do get a semi-random collection of invisible and special characters copying and pasting from Adobe Reader like you did with PDFBox, i.e. garbage. Most likely, therefore, nothing short of OCR will allow text extraction from it.

Upvotes: 1

karthik27
karthik27

Reputation: 484

I think the problem is encoding.. The pdf text is encoded in different format.. if you right click on the document and click on document properties.. you can find the encoding. I think the below links will give you more explanation

link1
link2

Upvotes: 0

Related Questions