Reputation: 11
I am facing an issue when using apache poi to extract an embedded .xlsx files from a .ppt file. It would be really great if somebody could help me out.
The subject of the problem:
Problem trying to solve: Extracting a ".xlsx" file embedded inside a ".ppt".
I am currently using apache-poi.
It seems that when I try to do it using hslfSlideShow.getEmbeddedObjects(), I get the xlsx object just fine but when I try converting it to the XLSFWorkbook object using say WorkbookFactory.create(inputStream), it threw an error saying
java.lang.IllegalArgumentException: The supplied POIFSFileSystem does not contain a BIFF8 'Workbook' entry. Is it really an excel file? Had: [OlePres000, Ole, CompObj, Package]
at org.apache.poi.hssf.usermodel.HSSFWorkbook.getWorkbookDirEntryName(HSSFWorkbook.java:286)
at org.apache.poi.hssf.usermodel.HSSFWorkbook.<init>(HSSFWorkbook.java:326)
at org.apache.poi.hssf.usermodel.HSSFWorkbookFactory.createWorkbook(HSSFWorkbookFactory.java:64)
at org.apache.poi.ss.usermodel.WorkbookFactory.create(WorkbookFactory.java:167)
at org.apache.poi.ss.usermodel.WorkbookFactory.create(WorkbookFactory.java:112)
at org.apache.poi.ss.usermodel.WorkbookFactory.create(WorkbookFactory.java:253)
at org.apache.poi.ss.usermodel.WorkbookFactory.create(WorkbookFactory.java:221)
Interestingly it is calling HSSFWorkbookFactory even though its an xlsx file.
And no the xlsx file is not corrupted/password-protected. I can open it just fine.
Also, it works fine if I try parsing the .xlsx file without embedding it in the .ppt.
And the parsing works fine when I embed it in a .pptx file and call methods such as xmlSlideShow.getAllEmbeddedParts() to get the embedded objects from .pptx.
Upvotes: 0
Views: 464
Reputation: 48326
Promoting some comments and investigation to an answer...
This was a limitation in older version of Apache POI, but was fixed in July in r1880164.
For backwards-compatibility reasons, PowerPoint will often (but not always...) write embedded OOXML resources wrapped in an intermediate OLE2 layer. This has the advantage that tools/programs which expect embedded office documents to be something like a xls
/ doc
to cope, but at the expense of another layer of wrapping.
Newer versions of Apache POI (5.0 should be the first released one with the fix in) have support in WorkbookFactory
for receiving an OLE2 wrapper like this, pulling out the underlying xlsx
stream and handing that off to XSSFWorkbook
. (Older versions did this for OLE2-based password-protected xlsx
files, but not their unencrypted cousins)
For now, if you're stuck on an affected POI version, the code you'll want is something like this (largely taken from the unit test verifying support!):
POIFSFileSystem fs = new POIFSFileSystem(data.getInputStream());
if(fs.getRoot().hasEntry("Package")) {
DocumentInputStream dis = new DocumentInputStream((DocumentEntry)fs.getRoot().getEntry("Package"));
try (OPCPackage pkg = OPCPackage.open(dis)) {
XSSFWorkbook wb = new XSSFWorkbook(pkg);
handleWorkbook(wb);
wb.close();
}
} else {
try (HSSFWorkbook wb = new HSSFWorkbook(fs)) {
handleWorkbook(wb);
}
}
Upvotes: 1