Karthik Ramachandran
Karthik Ramachandran

Reputation: 12175

Preventing Tika from using TNEFParser

I'm trying to parse mbox format email messages. However, Tika keeps trying to use the TNEFParser on these message resulting in an error :

2012-08-21 17:44:42,139 FATAL org.apache.hadoop.mapred.Child: Error running child : java.lang.OutOfMemoryError: Java heap space
    at org.apache.poi.hmef.attribute.TNEFAttribute.<init>(TNEFAttribute.java:50)
    at org.apache.poi.hmef.attribute.TNEFAttribute.create(TNEFAttribute.java:76)
    at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:74)
    at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:98)
    at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:98)
    at org.apache.poi.hmef.HMEFMessage.<init>(HMEFMessage.java:63)
    at org.apache.tika.parser.microsoft.TNEFParser.parse(TNEFParser.java:80)
    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
    at org.apache.tika.parser.mail.MailContentHandler.body(MailContentHandler.java:102)
    at org.apache.james.mime4j.parser.MimeStreamParser.parse(MimeStreamParser.java:133)
    at org.apache.tika.parser.mail.RFC822Parser.parse(RFC822Parser.java:76)
    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
    at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
    at org.lab41.asf.etl.mapred.MailboxToTextMapper.parse(MailboxToTextMapper.java:124)
    at org.lab41.asf.etl.mapred.MailboxToTextMapper.map(MailboxToTextMapper.java:88)
    at org.lab41.asf.etl.mapred.MailboxToTextMapper.map(MailboxToTextMapper.java:45)
    at org.apache.avro.mapred.HadoopMapper.map(HadoopMapper.java:81)
    at org.apache.avro.mapred.HadoopMapper.map(HadoopMapper.java:34)
    at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
    at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:391)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:325)
    at org.apache.hadoop.mapred.Child$4.run(Child.java:266)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:396)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1278)
    at org.apache.hadoop.mapred.Child.main(Child.java:260)

Is it possible to prevent Tika from using the TNEFParser? ANy suggestions would be helpful.

Upvotes: 2

Views: 2580

Answers (3)

Derek Troy-West
Derek Troy-West

Reputation: 2479

Here is the configuration version as suggested by @Gagravarr.

Firsly, create a tika-config.xml file:

<properties>
  <parsers>

    <!-- use the default parser in most cases, it is a composite of all 
         the parsers listed in META-INF/services/org.apache.tika.parser.Parser -->
    <parser class="org.apache.tika.parser.DefaultParser"/>

    <!-- Disable tnef extraction-->    
    <parser class="org.apache.tika.parser.EmptyParser">
      <mime>application/vnd.ms-tnef</mime>
      <mime>application/x-tnef</mime>
    </parser>

  </parsers>
</properties>

Now, create a TikaConfig from this configuration (assuming it is somewhere on your classpath):

ClassLoader loader = Thread.currentThread().getContextClassLoader();
TikaConfig config = new TikaConfig(loader.getResource("tika-config.xml"), loader);

When you creata a new Parser, or use the Tika facade, pass in your configuration:

AutoDetectParser parser = new AutoDetectParser(config);
ParseContext context = new ParseContext();
context.set(Parser.class, parser);
parser.parse(input, handler, metadata, context);

Any documents identified as TNEF will use the EmptyParser, which returns no content and doesn't actually parse anything.

This is effectively a blacklist, if you wanted a whitelist you would need to remove the DefaultParser from the XML and manually configure each parser and their meta-data.

Upvotes: 5

palacsint
palacsint

Reputation: 28885

Here is the programmatic version as it was suggested by @Gagravarr. It replaces the registered unnecessary parsers with an EmptyParser.

private Tika createTika(final Parser... unnecessaryParsers) 
        throws TikaException, IOException {
    final TikaConfig config = new TikaConfig();
    final AutoDetectParser autoDetectParser = new AutoDetectParser(config);

    final Set<MediaType> unnecessaryMimeTypes = 
        getUnnecessaryMediaTypes(unnecessaryParsers);
    disableParsing(autoDetectParser, unnecessaryMimeTypes);

    final Detector detector = config.getDetector();
    final Tika tika = new Tika(detector, autoDetectParser);
    return tika;
}

private Set<MediaType> getUnnecessaryMediaTypes(
        final Parser... unnecessaryParsers) {
    final Set<MediaType> unnecessaryTypes = new HashSet<MediaType>();
    for (final Parser unnecessaryParser: unnecessaryParsers) {
        final Set<MediaType> supportedTypes = 
            unnecessaryParser.getSupportedTypes(null);
        unnecessaryTypes.addAll(supportedTypes);
    }
    return unnecessaryTypes;
}

private void disableParsing(final CompositeParser mainParser, 
        final Set<MediaType> unnecessaryMediaTypes) {
    final EmptyParser emptyParser = new EmptyParser();

    final Map<MediaType, Parser> parsers = mainParser.getParsers();
    for (final MediaType unnecessaryType: unnecessaryMediaTypes) {
        parsers.put(unnecessaryType, emptyParser);
    }

    mainParser.setParsers(parsers);
}

Usage:

final Parser unnecessaryParser = new MP4Parser();
final Tika tika = createTika(unnecessaryParser);

You can also use it to avoid TIKA-1040: Could not delete temporary file.

Upvotes: 3

Gagravarr
Gagravarr

Reputation: 48346

For a long term fix, you should report this as a bug in Apache Tika, attach a problematic file to the bug report, and work with the project to get the bug fixed.

Short term, unpack the Tika-Parsers jar file, edit the META-INF/services/org.apache.tika.parser.Parser file and remove the TNEF parser from the list. That will stop it being auto-loaded and used by AutoDetectParser

Without changes to the Tika Parsers jar file, that's a little trickier. There are two options available. One is to create a TikaConfig instance yourself, rather than relying on the default one, and only supply a limited list of parsers to that. Depending on if you want to whitelist or blacklist, that might be easy or more difficult. Alternately, you could use the fact that the last registered parser for a mimetype wins. So, create your own jar with a services file, and your own dummy parser. Have that parser declare that it handles the TNEF mimetype, but have it do nothing. Add the jar to your classpath, and then your dummy parser will be used instead

Upvotes: 3

Related Questions