Reputation: 1405
Hi i am trying to read text from doc and docx file, for doc files i am doing this
package test;
import java.io.File;
import java.io.FileInputStream;
import org.apache.poi.hwpf.HWPFDocument;
import org.apache.poi.hwpf.extractor.WordExtractor;
public class ReadFile {
public static void main(String[] args) {
File file = null;
WordExtractor extractor = null;
try {
file = new File("C:\\Users\\rijo\\Downloads\\r.doc");
FileInputStream fis = new FileInputStream(file.getAbsolutePath());
HWPFDocument document = new HWPFDocument(fis);
extractor = new WordExtractor(document);
String fileData = extractor.getText();
System.out.println(fileData);
} catch (Exception exep) {
}
}
}
But this gives me an org/apache/poi/OldFileFormatException
exception.
Any idea how to fix this?
Also I need to read Docx and PDF files ? any good way to read all type of files?
Upvotes: 3
Views: 15324
Reputation: 3514
I do not know why you are using WordExtractor just to get text from .doc. For me it was enough to use one method:
import org.apache.poi.hwpf.HWPFDocument;
...
File fin = new File(yourFilePath);
FileInputStream fis = new FileInputStream(fin);
HWPFDocument doc = new HWPFDocument(fis);
String text = doc.getDocumentText();
System.out.println(text);
...
To work with .pdf use another Apache: pdfbox.
Upvotes: 0
Reputation: 3806
Using the following jars (In case version numbers are playing a role here):
dom4j-1.7-20060614
poi-3.9-20121203
poi-ooxml-3.9-20121203
poi-ooxml-schemas-3.9-20121203
poi-scratchpad-3.9-20121203
xmlbeans-2.4.0
I typed this up:
import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.IOException;
import org.apache.poi.xwpf.extractor.XWPFWordExtractor;
import org.apache.poi.xwpf.usermodel.XWPFDocument;
import org.apache.poi.hwpf.HWPFDocument;
import org.apache.poi.hwpf.extractor.WordExtractor;
public class SO {
public static void main(String[] args){
//Alternate between the two to check what works.
//String FilePath = "D:\\Users\\username\\Desktop\\Doc1.docx";
String FilePath = "D:\\Users\\username\\Desktop\\Bob.doc";
FileInputStream fis;
if(FilePath.substring(FilePath.length() -1).equals("x")){ //is a docx
try {
fis = new FileInputStream(new File(FilePath));
XWPFDocument doc = new XWPFDocument(fis);
XWPFWordExtractor extract = new XWPFWordExtractor(doc);
System.out.println(extract.getText());
} catch (IOException e) {
e.printStackTrace();
}
} else { //is not a docx
try {
fis = new FileInputStream(new File(FilePath));
HWPFDocument doc = new HWPFDocument(fis);
WordExtractor extractor = new WordExtractor(doc);
System.out.println(extractor.getText());
} catch (IOException e) {
e.printStackTrace();
}
}
}
}
this allowed me to read text from both a .docx and .doc respectively. If this doesn't work on your PC you may well have either an issue with the external jars you are using.
Give it a go though :) Good luck!
Upvotes: 7
Reputation: 45060
If you look at the javadocs of OldFileFormatException , you can see the reason for that
Base class of all the exceptions that POI throws in the event that it's given a file that's older than currently supported.
This means that the r.doc
you're using is not supported by the HWPFDocument. May be it supports the latest format(docx
has also been there for quite a long time now. Not sure if ApachePOI supports doc
format in the HWPFDocument
).
Upvotes: 1