jennifer ruurs
jennifer ruurs

Reputation: 324

Tika parsing error: You are calling the part of POI that deals with OOXML. You need to call a different part of POI to process this data

I want to parse a .doc file with Tika but it does not work.

The error that I get is:

Caused by: org.apache.poi.openxml4j.exceptions.OLE2NotOfficeXmlFileException: The supplied data appears to be in the OLE2 Format. You are calling the part of POI that deals with OOXML (Office Open XML) Documents. You need to call a different part of POI to process this data (eg HSSF instead of XSSF)

What exactly do I need to change?

import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;

import org.apache.tika.exception.TikaException;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.microsoft.ooxml.OOXMLParser;
import org.apache.tika.sax.BodyContentHandler;

import org.xml.sax.SAXException;

public class TikaDocx {

    public static void main(final String[] args) throws IOException, TikaException, SAXException {

        //detecting the file type
        BodyContentHandler handler = new BodyContentHandler();
        Metadata metadata = new Metadata();
        FileInputStream inputstream = new FileInputStream(new File("C:\\test.doc"));
        ParseContext pcontext = new ParseContext();

        //OOXml parser
        OOXMLParser  msofficeparser = new OOXMLParser ();
        msofficeparser.parse(inputstream, handler, metadata,pcontext);
        System.out.println("Contents of the document:" + handler.toString());
        System.out.println("Metadata of the document:");
        String[] metadataNames = metadata.names();

        for(String name : metadataNames) {
            System.out.println(name + ": " + metadata.get(name));
        }
    }
}

Upvotes: 0

Views: 3785

Answers (1)

Gagravarr
Gagravarr

Reputation: 48346

You should only call an explicit Apache Tika parser, eg OOXMLParser, if you already know what the file is and what the best parser for that file type is.

The error you are getting is telling you that you are passing an OLE2-based .doc parser to the Apache Tika parser for handling OOXML files such as .docx

Where you don't know what your file type is exactly, as you seem to, instead you should let Apache Tika identify the type + pick the best parser for you

To do that, change your current explicit line

    OOXMLParser  msofficeparser = new OOXMLParser ();
    msofficeparser.parse(inputstream, handler, metadata,pcontext);

To an Auto-Detect one

    AutoDetectParser parser = new AutoDetectParser();
    parser.parse(inputstream, handler, metadata, pcontext);

Then let Tika do the hard work for you!

Upvotes: 1

Related Questions