Reputation: 451
On creating a custom EmbeddedDocumentExtractor class, I need to parse embedded documents inside a file and perform some operation on limited number of embedded documents (say 10).
If I work with a file with 1000 embeddings, each embedding is processed which is absolute waste of time. Is there a way to limit to only parse first few embedded files?
public void parseEmbedded(InputStream stream, ContentHandler handler, Metadata metadata, boolean outputHtml) throws SAXException, IOException {
if(fileCount >= COUNT_LIMIT){
//skip file
}
else{
//perform op
}
}
By this method, there is a comparison which actually takes time comparing the fileCount(number of embedded files already processed) and COUNT_LIMIT instead of bringing the process to halt.
Upvotes: 0
Views: 616
Reputation: 635
Depending on your needs, perhaps try the RecursiveParserWrapper; you can set the maximum embedded depth in the RecursiveParserWrapperHandler. See for example: https://github.com/apache/tika/blob/2d73e91476325c235dc9a9be116e8d02c7658850/tika-parsers/src/test/java/org/apache/tika/parser/RecursiveParserWrapperTest.java#L204
Upvotes: 2
Reputation: 411
UPDATED after question from OP:
I understand you're already making a custom class that implements tika's EmbeddedDocumentExtractor
, as you start your question:
"On creating a custom EmbeddedDocumentExtractor class, "
So looking at the tika github I see that EmbeddedDocumentExtractor
is an interface that has been implemented by a class named ParsingEmbeddedDocumentExtractor
, which has a concrete method parseEmbedded
. I am going to assume this is the method you want to use, but with a limit of n .
I would suggest you make a custom class that implements EmbeddedDocumentExtractor
and inherits from ParsingEmbeddedDocumentExtractor
. On this class you define a variable named COUNT_LIMIT. Then you override the parseEmbedded method to do the following:
So it would look something like this:
class MyEmbeddedDocumentExtractor implements EmbeddedDocumentExtractor extends ParsingEmbeddedDocumentExtractor{
private static int COUNT_LIMIT = 10;
...
@Override parseEmbedded(
InputStream stream, ContentHandler handler, Metadata metadata, boolean outputHtml)
throws SAXException, IOException {
// separate the files in the InputStream
for(int i = 0; i < COUNT_LIMIT; i++){
super.parseEmbedded(streamOfOneFile, handler, metadata, outputHtml)
}
}
}
Upvotes: 2