Reputation: 379
PDFTextStripper stripper = new PDFTextStripper();
PDDocument document = PDDocument.load(inputStream);
String text = stripper.getText(document);
Extracted text: http://pastebin.com/BXFfMy0z
Problem pdf: http://www.iwb.ch/media/Unternehmen/Dokumente/inserat_leiter_pm.pdf
What can I do to extract correct text from this pdf file?
Upvotes: 1
Views: 3473
Reputation: 266
The original file should contain mapping to Unicode. This part is absent, thus you have got broken text after extraction.
Upvotes: 0
Reputation: 95918
In addition to @karthik27's answer:
Adobe Reader is fairly good at text extraction and, therefore, generally can be used as an indicator whether text extraction from a given document is possible at all.
Thus, whenever you have a document your own text extraction cannot handle, open it in the Reader and try copying & pasting from it. If that results in garbage, most likely it is not authored properly for text extraction, either by mistake or by design.
In case of your document I do get a semi-random collection of invisible and special characters copying and pasting from Adobe Reader like you did with PDFBox, i.e. garbage. Most likely, therefore, nothing short of OCR will allow text extraction from it.
Upvotes: 1