Reputation: 4283
I have a Java 8 program that walks the directory tree from a user-supplied node, searching for files that match a list of user-supplied filename patterns.
The list of matched files can be filtered with an optional user-supplied String
to find. The code checks for this string using the end result of parsing. This is really bad when huge files are found along the tree walk.
But it's bad anyway. As soon as the string to find is found, we're wasting time parsing the rest of the file.
Is there a way to have tika stop parsing a file once a match is found?
EDIT
The code that the program is based on:
package org.apache.tika.example;
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
import java.net.MalformedURLException;
import org.apache.tika.exception.TikaException;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.mime.MimeTypeException;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.sax.BodyContentHandler;
import org.xml.sax.SAXException;
public class ParsingExample {
public static boolean contains(File file, String s) throws MalformedURLException,
IOException, MimeTypeException, SAXException, TikaException
{
InputStream stream = new FileInputStream(file);
AutoDetectParser parser = new AutoDetectParser();
BodyContentHandler handler = new BodyContentHandler(-1);
Metadata metadata = new Metadata();
try
{
parser.parse(stream, handler, metadata);
return handler.toString().toLowerCase().contains(s.toLowerCase());
}
catch (IOException | SAXException | TikaException e)
{
System.out.println(file + ": " + e + "\n");
return false;
}
}
public static void main(String[] args)
{
try
{
System.out.println("File " + filename + " contains <" + searchString + "> : " + contains(new File(filename), searchString));
}
catch (IOException | SAXException | TikaException ex)
{
System.out.println("Error: " + ex);
}
}
static String parseExample = ":(";
static String searchString = "test";
static String filename = "test.doc";
}
Parser.parser
returns all the text in the file for BodyContentHandler
handler
. There's no loop available to the implementer of a parser. None that I'm aware of; hence the question.
EDIT 2
What I really want to know, I guess, is whether there's a tika method that only reads n
characters from a file instead of all. Then I could maybe stick a loop around it and exit if search string is found.
Upvotes: 0
Views: 1268
Reputation: 670
You can move query matching part into your own ContentHandler
implementation (you can take DefaultHandler
as base) with reassembling text from parts passed to ContentHander#characters(char[],int,int)
and abort parsing by throwing exception there after match found.
It's definitely not a pretty solution but it should stop parsing.
UPD code sample:
public class InterruptableParsingExample {
private Tika tika = new Tika(); // for default autodetect parser
public boolean findInFile(String query, File file) {
Metadata metadata = new Metadata();
InterruptingContentHandler handler = new InterruptingContentHandler(query);
ParseContext context = new ParseContext();
context.set(Parser.class, tika.getParser());
try (InputStream is = new BufferedInputStream(new FileInputStream(file))) {
tika.getParser().parse(is, handler, metadata, context);
} catch (QueryMatchedException e) {
return true;
} catch (SAXException | TikaException | IOException e) {
// something went wrong with parsing...
e.printStackTrace();
}
return false;
}
}
class QueryMatchedException extends SAXException {}
class InterruptingContentHandler extends DefaultHandler {
private String query;
private StringBuilder sb = new StringBuilder();
InterruptingContentHandler(String query) {
this.query = query;
}
@Override
public void characters(char[] ch, int start, int length) throws SAXException {
sb.append(new String(ch, start, length).toLowerCase());
if (sb.toString().contains(query))
throw new QueryMatchedException(); // interrupt parsing by throwing SaxException
if (sb.length() > 2 * query.length())
sb.delete(0, sb.length() - query.length()); // keep tail with query.length() chars
}
}
UPD2 Added to tika-example package: https://github.com/apache/tika/blob/trunk/tika-example/src/main/java/org/apache/tika/example/InterruptableParsingExample.java
Upvotes: 1