Rob
Rob

Reputation: 330

Using Tika 1.10 Parser to obtain file content

I have an unusual problem when attempting to obtain the content of a file using Tika Parser. The following code works fine, with several types of file input (e.g. doc, docx, txt, pdf) , when run within a JUnit test (i.e. I am able to obtain the text content of each file). When I run this code within my application, no text is returned. There is no exception, just an empty String from handler.toString().

public static String parseFile(final String path, final int charCountLimit) {

    if(path == null){
        throw new InvalidParameterException("parameter is null");
    } 

    if(charCountLimit < -1 || charCountLimit == 0){
        throw new InvalidParameterException("char count limit is out of range");
    }

    final File file = new File(path);

    if(! file.exists()){
        throw new InvalidParameterException(String.format("file does not exist %s", path));
    }

    try (InputStream stream = new FileInputStream(file.getAbsolutePath());){
        final AutoDetectParser parser = new AutoDetectParser();
        final BodyContentHandler handler = new BodyContentHandler(charCountLimit);

        Metadata metadata = new Metadata();
        /* the following setting is required for Office 2007 and later files, 
         * despite not being specified in the Tika Parser documentation
         */
        metadata.set(Metadata.RESOURCE_NAME_KEY, file.getName());

        parser.parse(stream, handler, metadata);
        return handler.toString();

    } catch (EncryptedDocumentException e){
        //handle exception
    } catch (IOException | SAXException | TikaException e) {
        //handle exception
    }
}

My first thought was that my application does something to the files I'm using, however I have ruled this out by making a static reference to one of the test case files on my file system.

A further thought I had was that I was having some kind of versioning conflict. In my project's POM I reference v 1.10 of tika-core, however a parent POM specified v 1.8. I've changed the parent POM's reference to 1.10, yet the problem remains.

    <dependency>
        <groupId>org.apache.tika</groupId>
        <artifactId>tika-parsers</artifactId>
        <version>1.10</version>
    </dependency>
    <dependency>
        <groupId>org.apache.tika</groupId>
        <artifactId>tika-core</artifactId>
        <version>1.10</version>
    </dependency>

I would be grateful for suggestions on how to resolve this problem.

UPDATE

Having worked through http://wiki.apache.org/tika/Troubleshooting%20Tika#No_Content_Extracted I've worked out that all the parsers are missing. In JUnit, the

org.apache.tika.parser.DefaultParser 

contains 58 parsers. When run on my JBoss 8 server, within the application, the DefaultParser contains no parsers. On adding the JVM parameter

-Dorg.apache.tika.service.error.warn=true 

there is no java.lang.NoClassDefFoundError indicating the inability to load the parser.

Upvotes: 1

Views: 789

Answers (1)

Rob
Rob

Reputation: 330

I fixed my problem. The issue was related to dependencies in the EAR file containing my "parse file" jar.

In my EAR's POM, there was already a dependency reference to tika-core. At runtime, the EAR's copy of tika-core was used to instantiate the AutoDetectParser. Since I had no dependency reference to tika-parsers in the EAR's POM, it was not possible to load the parser classes.

So, it seems the problem was caused by incorrect Maven POM dependency configuration, made harder by the fact that the DefaultParser (obtained by the AutoDetectParser) doesn't by default generate any output (or throw an exception) when it can't find any parsers.

Upvotes: 2

Related Questions