Reputation: 12175
I'm trying to parse mbox format email messages. However, Tika keeps trying to use the TNEFParser on these message resulting in an error :
2012-08-21 17:44:42,139 FATAL org.apache.hadoop.mapred.Child: Error running child : java.lang.OutOfMemoryError: Java heap space
at org.apache.poi.hmef.attribute.TNEFAttribute.<init>(TNEFAttribute.java:50)
at org.apache.poi.hmef.attribute.TNEFAttribute.create(TNEFAttribute.java:76)
at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:74)
at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:98)
at org.apache.poi.hmef.HMEFMessage.process(HMEFMessage.java:98)
at org.apache.poi.hmef.HMEFMessage.<init>(HMEFMessage.java:63)
at org.apache.tika.parser.microsoft.TNEFParser.parse(TNEFParser.java:80)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
at org.apache.tika.parser.mail.MailContentHandler.body(MailContentHandler.java:102)
at org.apache.james.mime4j.parser.MimeStreamParser.parse(MimeStreamParser.java:133)
at org.apache.tika.parser.mail.RFC822Parser.parse(RFC822Parser.java:76)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
at org.lab41.asf.etl.mapred.MailboxToTextMapper.parse(MailboxToTextMapper.java:124)
at org.lab41.asf.etl.mapred.MailboxToTextMapper.map(MailboxToTextMapper.java:88)
at org.lab41.asf.etl.mapred.MailboxToTextMapper.map(MailboxToTextMapper.java:45)
at org.apache.avro.mapred.HadoopMapper.map(HadoopMapper.java:81)
at org.apache.avro.mapred.HadoopMapper.map(HadoopMapper.java:34)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:391)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:325)
at org.apache.hadoop.mapred.Child$4.run(Child.java:266)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1278)
at org.apache.hadoop.mapred.Child.main(Child.java:260)
Is it possible to prevent Tika from using the TNEFParser? ANy suggestions would be helpful.
Upvotes: 2
Views: 2580
Reputation: 2479
Here is the configuration version as suggested by @Gagravarr.
Firsly, create a tika-config.xml file:
<properties>
<parsers>
<!-- use the default parser in most cases, it is a composite of all
the parsers listed in META-INF/services/org.apache.tika.parser.Parser -->
<parser class="org.apache.tika.parser.DefaultParser"/>
<!-- Disable tnef extraction-->
<parser class="org.apache.tika.parser.EmptyParser">
<mime>application/vnd.ms-tnef</mime>
<mime>application/x-tnef</mime>
</parser>
</parsers>
</properties>
Now, create a TikaConfig from this configuration (assuming it is somewhere on your classpath):
ClassLoader loader = Thread.currentThread().getContextClassLoader();
TikaConfig config = new TikaConfig(loader.getResource("tika-config.xml"), loader);
When you creata a new Parser, or use the Tika facade, pass in your configuration:
AutoDetectParser parser = new AutoDetectParser(config);
ParseContext context = new ParseContext();
context.set(Parser.class, parser);
parser.parse(input, handler, metadata, context);
Any documents identified as TNEF will use the EmptyParser, which returns no content and doesn't actually parse anything.
This is effectively a blacklist, if you wanted a whitelist you would need to remove the DefaultParser from the XML and manually configure each parser and their meta-data.
Upvotes: 5
Reputation: 28885
Here is the programmatic version as it was suggested by @Gagravarr. It replaces the registered unnecessary parsers with an EmptyParser
.
private Tika createTika(final Parser... unnecessaryParsers)
throws TikaException, IOException {
final TikaConfig config = new TikaConfig();
final AutoDetectParser autoDetectParser = new AutoDetectParser(config);
final Set<MediaType> unnecessaryMimeTypes =
getUnnecessaryMediaTypes(unnecessaryParsers);
disableParsing(autoDetectParser, unnecessaryMimeTypes);
final Detector detector = config.getDetector();
final Tika tika = new Tika(detector, autoDetectParser);
return tika;
}
private Set<MediaType> getUnnecessaryMediaTypes(
final Parser... unnecessaryParsers) {
final Set<MediaType> unnecessaryTypes = new HashSet<MediaType>();
for (final Parser unnecessaryParser: unnecessaryParsers) {
final Set<MediaType> supportedTypes =
unnecessaryParser.getSupportedTypes(null);
unnecessaryTypes.addAll(supportedTypes);
}
return unnecessaryTypes;
}
private void disableParsing(final CompositeParser mainParser,
final Set<MediaType> unnecessaryMediaTypes) {
final EmptyParser emptyParser = new EmptyParser();
final Map<MediaType, Parser> parsers = mainParser.getParsers();
for (final MediaType unnecessaryType: unnecessaryMediaTypes) {
parsers.put(unnecessaryType, emptyParser);
}
mainParser.setParsers(parsers);
}
Usage:
final Parser unnecessaryParser = new MP4Parser();
final Tika tika = createTika(unnecessaryParser);
You can also use it to avoid TIKA-1040: Could not delete temporary file.
Upvotes: 3
Reputation: 48346
For a long term fix, you should report this as a bug in Apache Tika, attach a problematic file to the bug report, and work with the project to get the bug fixed.
Short term, unpack the Tika-Parsers jar file, edit the META-INF/services/org.apache.tika.parser.Parser
file and remove the TNEF parser from the list. That will stop it being auto-loaded and used by AutoDetectParser
Without changes to the Tika Parsers jar file, that's a little trickier. There are two options available. One is to create a TikaConfig instance yourself, rather than relying on the default one, and only supply a limited list of parsers to that. Depending on if you want to whitelist or blacklist, that might be easy or more difficult. Alternately, you could use the fact that the last registered parser for a mimetype wins. So, create your own jar with a services file, and your own dummy parser. Have that parser declare that it handles the TNEF mimetype, but have it do nothing. Add the jar to your classpath, and then your dummy parser will be used instead
Upvotes: 3