Hrant Vardanyan
Hrant Vardanyan

Reputation: 253

Using Apache PDFBox for extracting text getting wrong charecters?

I'm trying to extract text from PDF using Apache PDFBox 1.8.4 - my code is bellow:

public static void main(String[] args) throws Exception {

        PDDocument pdfDocument = PDDocument.load(new File("rep.pdf"));
        PDFTextStripper stripper = new PDFTextStripper();
        String s =  stripper.getText(pdfDocument);
        System.out.println(s);
        pdfDocument.close();
    }

pdf which want to convert: https://www.dropbox.com/s/t35rr23v4383yvt/Form-V-report.pdf?dl=0

but got such charecters:

!"#$%&'()*$+,)!'-,./+/
0+12)3$#'(,,)451#+('1)65+7(,+'(/
!"#$%&'(
)*+,-.##(',/$.0
123.4.5,67,,89:;+
<3$'(=,>:++?,*99%@AB)

Any solutions?

In Advance - Thanks.

Upvotes: 2

Views: 786

Answers (1)

Will
Will

Reputation: 820

Adobe has integrated PDF obfuscation which can be enabled by the creator of PDFs. I can't recall exactly how it works, but you will find similar issues if you use any of the online PDF text-extraction tools, or if you try and copy and paste the text.

You likely need to either:

A) Ask for a copy without this enabled

or

B) Need to reverse engineer how it is done, and use that knowledge to reverse it.

I have a feeling A is the right answer

Upvotes: 3

Related Questions