Limit the number of Embedded files to be parsed in Tika

Question

On creating a custom EmbeddedDocumentExtractor class, I need to parse embedded documents inside a file and perform some operation on limited number of embedded documents (say 10).

If I work with a file with 1000 embeddings, each embedding is processed which is absolute waste of time. Is there a way to limit to only parse first few embedded files?

public void parseEmbedded(InputStream stream, ContentHandler handler, Metadata metadata, boolean outputHtml) throws SAXException, IOException {

          if(fileCount >= COUNT_LIMIT){
              //skip file
          }
          else{
             //perform op
          }
}

By this method, there is a comparison which actually takes time comparing the fileCount(number of embedded files already processed) and COUNT_LIMIT instead of bringing the process to halt.

SylarBenes · Accepted Answer

UPDATED after question from OP:

I understand you're already making a custom class that implements tika's EmbeddedDocumentExtractor, as you start your question:

"On creating a custom EmbeddedDocumentExtractor class, "

So looking at the tika github I see that EmbeddedDocumentExtractor is an interface that has been implemented by a class named ParsingEmbeddedDocumentExtractor, which has a concrete method parseEmbedded. I am going to assume this is the method you want to use, but with a limit of n .

I would suggest you make a custom class that implements EmbeddedDocumentExtractor and inherits from ParsingEmbeddedDocumentExtractor. On this class you define a variable named COUNT_LIMIT. Then you override the parseEmbedded method to do the following:

Separate the files in your InputStream
Put those separated files in a for loop that uses the limit
Call the parent method on each of those files.

So it would look something like this:

class MyEmbeddedDocumentExtractor implements EmbeddedDocumentExtractor extends ParsingEmbeddedDocumentExtractor{

private static int COUNT_LIMIT = 10;

...

@Override parseEmbedded(
            InputStream stream, ContentHandler handler, Metadata metadata, boolean outputHtml)
            throws SAXException, IOException {

 // separate the files in the InputStream

for(int i = 0; i < COUNT_LIMIT; i++){
        super.parseEmbedded(streamOfOneFile, handler, metadata, outputHtml)
    }
}

}

Limit the number of Embedded files to be parsed in Tika

Answers (2)

Related Questions