anchovie
anchovie

Reputation: 115

Why is my Tika Metadata object not being populated when using ForkParser?

ForkParser is a new Tika parser that was introduced in Tika version 0.9, located in org.apache.tika.fork. The new parser forks off a new jvm process to analyze the passed file stream. I figured this may be a good way to constrain how much memory I'm willing to devote to Tika's metadata extraction process. However, the Metadata object is not being populated with the appropriate metadata properties like it would when using an AutoDetectParser. Tests have shown that the BodyContentHandler object is not null.

Why is the Metadata object not being populated with anything (except the manually added RESOURCE_NAME_KEY)?

public static Metadata getMetadata(File f) {
    Metadata metadata = new Metadata();
    try {
        FileInputStream fis = new FileInputStream(f);
        BodyContentHandler contentHandler = new BodyContentHandler(-1);
        ParseContext context = new ParseContext();
        ForkParser parser = new ForkParser();

        parser.setJavaCommand("/usr/local/java6/bin/java -Xmx64m");
        metadata.set(Metadata.RESOURCE_NAME_KEY, f.getName());

        parser.parse(fis, contentHandler, metadata, context);
        fis.close();

        String contentType = metadata.get(Metadata.CONTENT_TYPE);

        logger.error("contentHandler: " + contentHandler.toString());
        logger.error("metadata: " + metadata.toString());

        return metadata;

    } catch (Throwable e) {
        logger.error("Exception while analyzing file\n" +
        "CAUTION: metadata may still have useful content in it!\n" +
        "Exception: " + e, e);

        return metadata;
    }
}

Upvotes: 4

Views: 977

Answers (1)

Jukka Zitting
Jukka Zitting

Reputation: 1092

The ForkParser class in Tika 1.0 unfortunately does not support metadata extraction since for now the communication channel to the forked parser process only supports passing back SAX events but not metadata entries. I suggest you file a TIKA improvement issue to get this fixed.

One workaround you might want to consider is getting the extracted metadata from the <meta> tags in the <head> section of the XHTML document returned by the forked parser. Those should be available and contain most of the metadata entries normally returned in the Metadata object.

Upvotes: 3

Related Questions