Santhosh
Santhosh

Reputation: 451

Limit the number of Embedded files to be parsed in Tika

On creating a custom EmbeddedDocumentExtractor class, I need to parse embedded documents inside a file and perform some operation on limited number of embedded documents (say 10).

If I work with a file with 1000 embeddings, each embedding is processed which is absolute waste of time. Is there a way to limit to only parse first few embedded files?

public void parseEmbedded(InputStream stream, ContentHandler handler, Metadata metadata, boolean outputHtml) throws SAXException, IOException {

          if(fileCount >= COUNT_LIMIT){
              //skip file
          }
          else{
             //perform op
          }
}

By this method, there is a comparison which actually takes time comparing the fileCount(number of embedded files already processed) and COUNT_LIMIT instead of bringing the process to halt.

Upvotes: 0

Views: 616

Answers (2)

Tim Allison
Tim Allison

Reputation: 635

Depending on your needs, perhaps try the RecursiveParserWrapper; you can set the maximum embedded depth in the RecursiveParserWrapperHandler. See for example: https://github.com/apache/tika/blob/2d73e91476325c235dc9a9be116e8d02c7658850/tika-parsers/src/test/java/org/apache/tika/parser/RecursiveParserWrapperTest.java#L204

Upvotes: 2

SylarBenes
SylarBenes

Reputation: 411

UPDATED after question from OP:

I understand you're already making a custom class that implements tika's EmbeddedDocumentExtractor, as you start your question:

"On creating a custom EmbeddedDocumentExtractor class, "

So looking at the tika github I see that EmbeddedDocumentExtractor is an interface that has been implemented by a class named ParsingEmbeddedDocumentExtractor, which has a concrete method parseEmbedded. I am going to assume this is the method you want to use, but with a limit of n .

I would suggest you make a custom class that implements EmbeddedDocumentExtractor and inherits from ParsingEmbeddedDocumentExtractor. On this class you define a variable named COUNT_LIMIT. Then you override the parseEmbedded method to do the following:

  1. Separate the files in your InputStream
  2. Put those separated files in a for loop that uses the limit
  3. Call the parent method on each of those files.

So it would look something like this:

class MyEmbeddedDocumentExtractor implements EmbeddedDocumentExtractor extends ParsingEmbeddedDocumentExtractor{

private static int COUNT_LIMIT = 10;

...

@Override parseEmbedded(
            InputStream stream, ContentHandler handler, Metadata metadata, boolean outputHtml)
            throws SAXException, IOException {

 // separate the files in the InputStream

for(int i = 0; i < COUNT_LIMIT; i++){
        super.parseEmbedded(streamOfOneFile, handler, metadata, outputHtml)
    }
}

}

Upvotes: 2

Related Questions