Reputation: 1
I am trying to read and process .doc, .docx, .pdf files in Java by converting them into a single string using Apache POI (for doc,docx) and Apache PDFBox (for pdf) libraries.
It works fine until it encounters textboxes.
If the format is like this:
paragraph 1
textbox 1
paragraph 2
textbox 2
paragraph 3
Then the output should be:
paragraph 1 textbox 1 paragraph 2 textbox 2 paragraph 3
But the output I am getting is:
paragraph 1 paragraph 2 paragraph 3 textbox 1 textbox 2
It seems to be adding textboxes at the end and not at the place where it should be, ie between the paragraphs. This problem is both in the cases of doc and pdf files. That means both libraries, POI and PDFBox are giving the same problem.
The code for reading pdf file is:
void pdf(String file) throws IOException { //Initialise file File myFile = new File(file); PDDocument pdDoc = null; try { //Load PDF pdDoc = PDDocument.load(myFile); //Create extractor PDFTextStripper pdf = new PDFTextStripper(); //Extract text output = pdf.getText(pdDoc); } finally { if(pdDoc != null) //Close document pdDoc.close(); } }
And code for doc file is:
void doc(String file) throws FileNotFoundException, IOException { File myFile = null; WordExtractor extractor = null ; //initialise file myFile = new File(file); //create file input stream FileInputStream fis=new FileInputStream(myFile.getAbsolutePath()); //open document HWPFDocument document=new HWPFDocument(fis); //create extractor extractor = new WordExtractor(document); //get text from document output = extractor.getText(); }
Upvotes: 0
Views: 2040
Reputation: 404
Try below code for pdf. In similar fashion you can try to for doc as well.
void extractPdfTexts(String file) {
File myFile = new File(file);
String output;
try (PDDocument pdDocument = PDDocument.load(myFile)) {
PDFTextStripper pdfTextStripper = new PDFTextStripper();
pdfTextStripper.setSortByPosition(true);
output = pdfTextStripper.getText(pdDocument);
System.out.println(output);
} catch (InvalidPasswordException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
}
Upvotes: 0