Reputation: 253
I'm trying to extract text from PDF using Apache PDFBox 1.8.4 - my code is bellow:
public static void main(String[] args) throws Exception {
PDDocument pdfDocument = PDDocument.load(new File("rep.pdf"));
PDFTextStripper stripper = new PDFTextStripper();
String s = stripper.getText(pdfDocument);
System.out.println(s);
pdfDocument.close();
}
pdf which want to convert: https://www.dropbox.com/s/t35rr23v4383yvt/Form-V-report.pdf?dl=0
but got such charecters:
!"#$%&'()*$+,)!'-,./+/
0+12)3$#'(,,)451#+('1)65+7(,+'(/
!"#$%&'(
)*+,-.##(',/$.0
123.4.5,67,,89:;+
<3$'(=,>:++?,*99%@AB)
Any solutions?
In Advance - Thanks.
Upvotes: 2
Views: 786
Reputation: 820
Adobe has integrated PDF obfuscation which can be enabled by the creator of PDFs. I can't recall exactly how it works, but you will find similar issues if you use any of the online PDF text-extraction tools, or if you try and copy and paste the text.
You likely need to either:
A) Ask for a copy without this enabled
or
B) Need to reverse engineer how it is done, and use that knowledge to reverse it.
I have a feeling A is the right answer
Upvotes: 3