Reputation: 378
I am trying to read a Microsoft Word 2016 document but I can't...
private String readDoc(String path) {
String content = "";
try {
File file = new File(path);
FileInputStream fis = new FileInputStream(file.getAbsolutePath());
HWPFDocument doc = new HWPFDocument(fis);
WordExtractor we = new WordExtractor(doc);
String[] paragraphs = we.getParagraphText();
for (String para : paragraphs) {
content += para.toString();
}
fis.close();
return content;
} catch (Exception e) {
e.printStackTrace();
}
return content;
}
Exception in thread "main" org.apache.poi.poifs.filesystem.OfficeXmlFileException: The supplied data appears to be in the Office 2007+ XML. You are calling the part of POI that deals with OLE2 Office Documents. You need to call a different part of POI to process this data (eg XSSF instead of HSSF)
I don't get it... why does it give me this Exception, because I am not using any XSSF (I think).
Upvotes: 0
Views: 1373
Reputation: 188
Try this:
FileInputStream fis = new FileInputStream("test.docx");
XWPFDocument xdoc = new XWPFDocument(OPCPackage.open(fis));
XWPFWordExtractor extractor = new XWPFWordExtractor(xdoc);
System.out.println(extractor.getText());
It can help understand this:
POIFS (Poor Obfuscation Implementation File System) − This component is the basic factor of all other POI elements. It is used to read different files explicitly.
HSSF (Horrible SpreadSheet Format) − It is used to read and write .xls format of MS-Excel files.
XSSF (XML SpreadSheet Format) − It is used for .xlsx file format of MS-Excel.
HPSF (Horrible Property Set Format) − It is used to extract property sets of the MS-Office files.
HWPF (Horrible Word Processor Format) − It is used to read and write .doc extension files of MS-Word.
XWPF (XML Word Processor Format) − It is used to read and write .docx extension files of MS-Word.
HSLF (Horrible Slide Layout Format) − It is used to read, create, and edit PowerPoint presentations.
HDGF (Horrible DiaGram Format) − It contains classes and methods for MS-Visio binary files.
HPBF (Horrible PuBlisher Format) − It is used to read and write MS-Publisher files.
Upvotes: 2