DSlomer64
DSlomer64

Reputation: 4283

Is there a way to have tika stop parsing a file once a match is found?

I have a Java 8 program that walks the directory tree from a user-supplied node, searching for files that match a list of user-supplied filename patterns.

The list of matched files can be filtered with an optional user-supplied String to find. The code checks for this string using the end result of parsing. This is really bad when huge files are found along the tree walk.

But it's bad anyway. As soon as the string to find is found, we're wasting time parsing the rest of the file.

Is there a way to have tika stop parsing a file once a match is found?

EDIT

The code that the program is based on:

package org.apache.tika.example;
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
import java.net.MalformedURLException;
import org.apache.tika.exception.TikaException;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.mime.MimeTypeException;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.sax.BodyContentHandler;
import org.xml.sax.SAXException;

public class ParsingExample {

  public static boolean contains(File file, String s) throws MalformedURLException, 
                     IOException, MimeTypeException, SAXException, TikaException
  {
    InputStream         stream    = new FileInputStream(file);
    AutoDetectParser    parser    = new AutoDetectParser();
    BodyContentHandler  handler   = new BodyContentHandler(-1);
    Metadata            metadata  = new Metadata();
    try
    {
      parser.parse(stream, handler, metadata);
      return handler.toString().toLowerCase().contains(s.toLowerCase());
    }
    catch (IOException | SAXException | TikaException e)
    {
      System.out.println(file + ": " + e + "\n");
      return false;
    }
  }
  public static void main(String[] args)
  {
      try 
      {
        System.out.println("File " + filename + " contains <" + searchString + "> : " + contains(new File(filename), searchString));
      } 
      catch (IOException | SAXException | TikaException ex) 
      {
        System.out.println("Error: " + ex);
      }
  }   

  static String parseExample = ":(";
  static String searchString = "test";
  static String filename = "test.doc";
}

Parser.parser returns all the text in the file for BodyContentHandler handler. There's no loop available to the implementer of a parser. None that I'm aware of; hence the question.

EDIT 2

What I really want to know, I guess, is whether there's a tika method that only reads n characters from a file instead of all. Then I could maybe stick a loop around it and exit if search string is found.

Upvotes: 0

Views: 1268

Answers (1)

Konstantin Gribov
Konstantin Gribov

Reputation: 670

You can move query matching part into your own ContentHandler implementation (you can take DefaultHandler as base) with reassembling text from parts passed to ContentHander#characters(char[],int,int) and abort parsing by throwing exception there after match found.

It's definitely not a pretty solution but it should stop parsing.

UPD code sample:

public class InterruptableParsingExample {
    private Tika tika = new Tika(); // for default autodetect parser

    public boolean findInFile(String query, File file) {
        Metadata metadata = new Metadata();
        InterruptingContentHandler handler = new InterruptingContentHandler(query);
        ParseContext context = new ParseContext();
        context.set(Parser.class, tika.getParser());

        try (InputStream is = new BufferedInputStream(new FileInputStream(file))) {
            tika.getParser().parse(is, handler, metadata, context);
        } catch (QueryMatchedException e) {
            return true;
        } catch (SAXException | TikaException | IOException e) {
            // something went wrong with parsing...
            e.printStackTrace();
        }
        return false;
    }
}

class QueryMatchedException extends SAXException {}

class InterruptingContentHandler extends DefaultHandler {
    private String query;
    private StringBuilder sb = new StringBuilder();

    InterruptingContentHandler(String query) {
        this.query = query;
    }

    @Override
    public void characters(char[] ch, int start, int length) throws SAXException {
        sb.append(new String(ch, start, length).toLowerCase());

        if (sb.toString().contains(query))
            throw new QueryMatchedException(); // interrupt parsing by throwing SaxException

        if (sb.length() > 2 * query.length())
            sb.delete(0, sb.length() - query.length()); // keep tail with query.length() chars
    }
}

UPD2 Added to tika-example package: https://github.com/apache/tika/blob/trunk/tika-example/src/main/java/org/apache/tika/example/InterruptableParsingExample.java

Upvotes: 1

Related Questions