Reputation: 3275
I want Tika to output only the names of the files within the archive (if the input file is an archive) and the file content as usual if the input file is not an archive. How can I do that?
Upvotes: 1
Views: 1462
Reputation: 3275
I extended the ParsingEmbeddedDocumentExtractor class
class CustomParsingEmbeddedDocumentExtractor extends ParsingEmbeddedDocumentExtractor {
public CustomParsingEmbeddedDocumentExtractor(ParseContext context) {
super(context);
}
public void parseEmbedded(
InputStream stream, ContentHandler handler, Metadata metadata, boolean outputHtml)
throws SAXException, IOException {
String name = metadata.get(Metadata.RESOURCE_NAME_KEY);
if (name != null && name.length() > 0 ) {
handler.startElement(XHTML, "h1", "h1", new AttributesImpl());
char[] chars = name.toCharArray();
handler.characters(chars, 0, chars.length);
handler.endElement(XHTML, "h1", "h1");
}
//Removed the parsing logic here.. We just want the file names..
}
}
and set it to the ParseContext variable before doing someparser.parse()
context.set(EmbeddedDocumentExtractor.class, new CustomParsingEmbeddedDocumentExtractor(this.context));
This works only for zip, tar and jar. That's enough for me.
Upvotes: 4