akgnitd
akgnitd

Reputation: 1

Regex Matching in Java Performance Improvement

I am trying to match big regexes on multiline texts. The Execution time is taking around 3-4 minutes for some of regexes. This is basically causing performance Issues Code snippet

boolean matchedRegex = false;

for (Rules rule : rules) {
    String mergedRegex = rule.getRegexes().stream().collect(Collectors.joining("|"));
    final Pattern pattern = Pattern.compile(mergedRegex, Pattern.MULTILINE | Pattern.DOTALL);
    System.out.println(String.format("Pattern: %s", pattern));
    if (pattern.matcher(text).find()) {
        matchedRegex = true;
        break;
    }
}
mergedRegex = "(?=.*MORTGAGE\b)(?=.* This Security Instrument is given to\b).*|(?=.*MORTGAGE\b)(?=.*Words used in multiple sections|WORDS USED OFTEN IN THIS DOCUMENT|The date of this Mortgage\b)(?=.*Security Instrument).*|(?=.*\bTHIS MORTGAGE made\b)(?=.*\bWITNESSETH\b).*|(?=.*\bMORTGAGE\b)(?=.*\bTHIS INDENTURE\b)(?=.*made the).*|(?=.*\bThis bond and mortgage\b)(?=.*\bmade the day of\b)(?=.*\bWitnesseth\b).*|(?=.*\bTHIS MORTGAGE\b)(?=.*\bis made this|is given on|is given to|by and between|is made on|entered into this\b).*|(?=.*\bCREDIT MORTGAGE\b)(?=.*Space Above This Line For Recording Data).*|(?=.*\bDOWN PAYMENT ASSISTANCE MORTGAGE\b)(?=.*THIS MORTGAGE).*|(?=.*\bSECURITY DEED\b)(?=.*\bWords used in multiple sections\b)(?=.*Security Instrument).*|(?=.*DOWN PAYMENT ASSISTANCE MORTGAGE\b)(?=.*\bmade and entered\b).*";

What i could do here in this for better performance is to merge the regexes present inside rule.getRegexes() into one consolidated regex And Finally i am executing the merged regex for each rule.

Upvotes: 0

Views: 98

Answers (1)

Nikolas
Nikolas

Reputation: 44368

I suppose this is an unstructured document. I see no way of optimization of the Regex but the approach to the document instead.

It depends on how much each document is regular, predictive and structured. There are a few ways to go:

  • Stick with the current solution but change an approach a bit. Split the document into chunks if the structure allows it. Rather perform multiple targeted searches in the smaller partitions than in the whole document. The advantage is also you might get an idea what you can expect in each partition and the Regex for each becomes smaller and faster.
  • Index the document and lookup for the tools specialized for text mining. If the document is generated from some structured data such as XML, work with these instead.

Upvotes: 1

Related Questions