Reputation: 324
I want to parse a .doc
file with Tika but it does not work.
The error that I get is:
Caused by: org.apache.poi.openxml4j.exceptions.OLE2NotOfficeXmlFileException: The supplied data appears to be in the OLE2 Format. You are calling the part of POI that deals with OOXML (Office Open XML) Documents. You need to call a different part of POI to process this data (eg HSSF instead of XSSF)
What exactly do I need to change?
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import org.apache.tika.exception.TikaException;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.microsoft.ooxml.OOXMLParser;
import org.apache.tika.sax.BodyContentHandler;
import org.xml.sax.SAXException;
public class TikaDocx {
public static void main(final String[] args) throws IOException, TikaException, SAXException {
//detecting the file type
BodyContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
FileInputStream inputstream = new FileInputStream(new File("C:\\test.doc"));
ParseContext pcontext = new ParseContext();
//OOXml parser
OOXMLParser msofficeparser = new OOXMLParser ();
msofficeparser.parse(inputstream, handler, metadata,pcontext);
System.out.println("Contents of the document:" + handler.toString());
System.out.println("Metadata of the document:");
String[] metadataNames = metadata.names();
for(String name : metadataNames) {
System.out.println(name + ": " + metadata.get(name));
}
}
}
Upvotes: 0
Views: 3785
Reputation: 48346
You should only call an explicit Apache Tika parser, eg OOXMLParser
, if you already know what the file is and what the best parser for that file type is.
The error you are getting is telling you that you are passing an OLE2-based .doc
parser to the Apache Tika parser for handling OOXML files such as .docx
Where you don't know what your file type is exactly, as you seem to, instead you should let Apache Tika identify the type + pick the best parser for you
To do that, change your current explicit line
OOXMLParser msofficeparser = new OOXMLParser ();
msofficeparser.parse(inputstream, handler, metadata,pcontext);
To an Auto-Detect one
AutoDetectParser parser = new AutoDetectParser();
parser.parse(inputstream, handler, metadata, pcontext);
Then let Tika do the hard work for you!
Upvotes: 1