Vigneshwaran
Vigneshwaran

Reputation: 3275

Extract just the names of the files from archive using Apache Tika

I want Tika to output only the names of the files within the archive (if the input file is an archive) and the file content as usual if the input file is not an archive. How can I do that?

Upvotes: 1

Views: 1462

Answers (1)

Vigneshwaran
Vigneshwaran

Reputation: 3275

I extended the ParsingEmbeddedDocumentExtractor class

class CustomParsingEmbeddedDocumentExtractor extends ParsingEmbeddedDocumentExtractor {
  public CustomParsingEmbeddedDocumentExtractor(ParseContext context) {
    super(context);
  }

  public void parseEmbedded(
        InputStream stream, ContentHandler handler, Metadata metadata, boolean outputHtml)
        throws SAXException, IOException {


    String name = metadata.get(Metadata.RESOURCE_NAME_KEY);
    if (name != null && name.length() > 0 ) {
        handler.startElement(XHTML, "h1", "h1", new AttributesImpl());
        char[] chars = name.toCharArray();
        handler.characters(chars, 0, chars.length);
        handler.endElement(XHTML, "h1", "h1");
    }

    //Removed the parsing logic here.. We just want the file names..

  }
}

and set it to the ParseContext variable before doing someparser.parse()

context.set(EmbeddedDocumentExtractor.class, new CustomParsingEmbeddedDocumentExtractor(this.context));

This works only for zip, tar and jar. That's enough for me.

Upvotes: 4

Related Questions