Vipul
Vipul

Reputation: 1

Mispositioned textboxes in Reading doc, pdf files using Apache POI and Apache PDFBox

I am trying to read and process .doc, .docx, .pdf files in Java by converting them into a single string using Apache POI (for doc,docx) and Apache PDFBox (for pdf) libraries.
It works fine until it encounters textboxes. If the format is like this:

paragraph 1
textbox 1
paragraph 2
textbox 2
paragraph 3

Then the output should be:
paragraph 1 textbox 1 paragraph 2 textbox 2 paragraph 3
But the output I am getting is:
paragraph 1 paragraph 2 paragraph 3 textbox 1 textbox 2

It seems to be adding textboxes at the end and not at the place where it should be, ie between the paragraphs. This problem is both in the cases of doc and pdf files. That means both libraries, POI and PDFBox are giving the same problem.

The code for reading pdf file is:


    void pdf(String file) throws IOException {
        //Initialise file
        File myFile = new File(file);
        PDDocument pdDoc = null;
        try {
            //Load PDF
            pdDoc = PDDocument.load(myFile);
            //Create extractor
            PDFTextStripper pdf = new PDFTextStripper();
            //Extract text
            output = pdf.getText(pdDoc);
        }
        finally {
            if(pdDoc != null)
                //Close document
                pdDoc.close();
        }
    }

And code for doc file is:


    void doc(String file) throws FileNotFoundException, IOException {
        File myFile = null;
        WordExtractor extractor = null ;
        //initialise file
        myFile = new File(file);
        //create file input stream
        FileInputStream fis=new FileInputStream(myFile.getAbsolutePath());
        //open document
        HWPFDocument document=new HWPFDocument(fis);
        //create extractor
        extractor = new WordExtractor(document);
        //get text from document
        output = extractor.getText();
    }

Upvotes: 0

Views: 2040

Answers (2)

Diptman
Diptman

Reputation: 404

Try below code for pdf. In similar fashion you can try to for doc as well.

void extractPdfTexts(String file) {
    File myFile = new File(file);
    String output;
    try (PDDocument pdDocument = PDDocument.load(myFile)) {
        PDFTextStripper pdfTextStripper = new PDFTextStripper();
        pdfTextStripper.setSortByPosition(true);
        output = pdfTextStripper.getText(pdDocument);
        System.out.println(output);
    } catch (InvalidPasswordException e) {
        e.printStackTrace();
    } catch (IOException e) {
        e.printStackTrace();
    }
}

Upvotes: 0

impeto
impeto

Reputation: 350

For PDFBox do this: pdf.setSortByPosition(true);

Upvotes: 3

Related Questions