AzmahQi
AzmahQi

Reputation: 378

Exception org.apache.poi.poifs.filesystem.OfficeXmlFileException - apache.Poi 4.0.0

I am trying to read a Microsoft Word 2016 document but I can't...

private String readDoc(String path) {
String content = "";
try {
    File file = new File(path);
    FileInputStream fis = new FileInputStream(file.getAbsolutePath());

    HWPFDocument doc = new HWPFDocument(fis);

    WordExtractor we = new WordExtractor(doc);
    String[] paragraphs = we.getParagraphText();
    for (String para : paragraphs) {
        content += para.toString();
    }
    fis.close();
    return content;
} catch (Exception e) {
    e.printStackTrace();
}
return content;
}

Exception in thread "main" org.apache.poi.poifs.filesystem.OfficeXmlFileException: The supplied data appears to be in the Office 2007+ XML. You are calling the part of POI that deals with OLE2 Office Documents. You need to call a different part of POI to process this data (eg XSSF instead of HSSF)

I don't get it... why does it give me this Exception, because I am not using any XSSF (I think).

Upvotes: 0

Views: 1373

Answers (1)

SURU
SURU

Reputation: 188

Try this:

FileInputStream fis = new FileInputStream("test.docx");
XWPFDocument xdoc = new XWPFDocument(OPCPackage.open(fis));
XWPFWordExtractor extractor = new XWPFWordExtractor(xdoc);
System.out.println(extractor.getText());

It can help understand this:

POIFS (Poor Obfuscation Implementation File System) − This component is the basic factor of all other POI elements. It is used to read different files explicitly.

HSSF (Horrible SpreadSheet Format) − It is used to read and write .xls format of MS-Excel files.

XSSF (XML SpreadSheet Format) − It is used for .xlsx file format of MS-Excel.

HPSF (Horrible Property Set Format) − It is used to extract property sets of the MS-Office files.

HWPF (Horrible Word Processor Format) − It is used to read and write .doc extension files of MS-Word.

XWPF (XML Word Processor Format) − It is used to read and write .docx extension files of MS-Word.

HSLF (Horrible Slide Layout Format) − It is used to read, create, and edit PowerPoint presentations.

HDGF (Horrible DiaGram Format) − It contains classes and methods for MS-Visio binary files.

HPBF (Horrible PuBlisher Format) − It is used to read and write MS-Publisher files.

Upvotes: 2

Related Questions