Reputation: 21329
While trying to read a .docx
file I am get the following exception :
org.apache.poi.poifs.filesystem.OfficeXmlFileException: The supplied data
appears to be in the Office 2007+ XML. You are calling the part of POI
that deals with OLE2 Office Documents. You need to call a different part of POI
to process this data (eg XSSF instead of HSSF)
at org.apache.poi.poifs.storage.HeaderBlock.<init>(HeaderBlock.java:131)
at org.apache.poi.poifs.storage.HeaderBlock.<init>(HeaderBlock.java:104)
at org.apache.poi.poifs.filesystem.POIFSFileSystem.<init>(POIFSFileSystem.java:128)
at org.apache.poi.hwpf.HWPFDocumentCore.verifyAndBuildPOIFS(HWPFDocumentCore.java:106)
at org.apache.poi.hwpf.extractor.WordExtractor.<init>(WordExtractor.java:53)
at org.suhail.gui.Main.parseDocxFile(Main.java:245)
at org.suhail.gui.Main.jButton1ActionPerformed(Main.java:166)
at org.suhail.gui.Main.access$000(Main.java:21)
at org.suhail.gui.Main$1.actionPerformed(Main.java:70)
at javax.swing.AbstractButton.fireActionPerformed(AbstractButton.java:1995)
at javax.swing.AbstractButton$Handler.actionPerformed(AbstractButton.java:2318)
at javax.swing.DefaultButtonModel.fireActionPerformed(DefaultButtonModel.java:387)
at javax.swing.DefaultButtonModel.setPressed(DefaultButtonModel.java:242)
at javax.swing.plaf.basic.BasicButtonListener.mouseReleased(BasicButtonListener.java:236)
at java.awt.Component.processMouseEvent(Component.java:6038)
at javax.swing.JComponent.processMouseEvent(JComponent.java:3260)
at java.awt.Component.processEvent(Component.java:5803)
at java.awt.Container.processEvent(Container.java:2058)
at java.awt.Component.dispatchEventImpl(Component.java:4410)
at java.awt.Container.dispatchEventImpl(Container.java:2116)
at java.awt.Component.dispatchEvent(Component.java:4240)
at java.awt.LightweightDispatcher.retargetMouseEvent(Container.java:4322)
at java.awt.LightweightDispatcher.processMouseEvent(Container.java:3986)
at java.awt.LightweightDispatcher.dispatchEvent(Container.java:3916)
at java.awt.Container.dispatchEventImpl(Container.java:2102)
at java.awt.Window.dispatchEventImpl(Window.java:2429)
at java.awt.Component.dispatchEvent(Component.java:4240)
at java.awt.EventQueue.dispatchEvent(EventQueue.java:599)
at java.awt.EventDispatchThread.pumpOneEventForFilters(EventDispatchThread.java:273)
at java.awt.EventDispatchThread.pumpEventsForFilter(EventDispatchThread.java:183)
at java.awt.EventDispatchThread.pumpEventsForHierarchy(EventDispatchThread.java:173)
at java.awt.EventDispatchThread.pumpEvents(EventDispatchThread.java:168)
at java.awt.EventDispatchThread.pumpEvents(EventDispatchThread.java:160)
at java.awt.EventDispatchThread.run(EventDispatchThread.java:121)
I do get an idea of what could be the reason for the exception but don't exactly understand the reason.
The .docx
file saved is from a 2007 MS Word
software.
Snippet that is parsing the file :
public void parseDocxFile(String textEntered) {
try {
WordExtractor extractor = new WordExtractor(new FileInputStream(new File(SContainer.getFilePath())));
System.out.println(".DOCX File : " + extractor.getText());
}catch(Exception exc) {exc.printStackTrace();}
}
Note: I am using the latest version of POI 3.10
Upvotes: 0
Views: 3357
Reputation: 3364
As per Apache POI docs for docx you have to use XWPF not HWPF apis.
HWPF is the name of our port of the Microsoft Word 97(-2007) file format to pure Java. It also provides limited read only support for the older Word 6 and Word 95 file formats. The partner to HWPF for the new Word 2007 .docx format is XWPF
use XWPF apis to read docx file.
For basic text extraction, make use of org.apache.poi.xwpf.extractor.XWPFWordExtractor. It accepts an input stream or a XWPFDocument. The getText() method can be used to get the text from all the paragraphs, along with tables, headers etc.
UPDATED: To address your question.
What is the difference between the two ?
Let me try to address IMO . .doc format is binary obfuscated document format , which required third party support to read those documents, so the HWPF uses third party support to implement But from 2007 Microsoft uses the OOXML (Office Object eXtended Markup Language) , which is publicly available thus to implement API to read this format become easier . So Apache implemented another set of API to read OOXMl format files ( .docx) .
HWPF and XWPF does not share any common interface / methods / code. both are independent.
And i found this link Provides samples using both the framework. It may useful.
Upvotes: 4