mmotti
mmotti

Reputation: 341

Java - Multiline regex bad performance with large string

I am trying to use a multi-line regex to match all wildcards in a given source string. These strings can be in excess of 70,000 lines and each item is separated by a new line.

I seem to be experiencing huge processing times for my current regex and I can only assume that this is because it is probably poorly constructed and inefficient. If I execute the code on my phone it seems to run for an eternity.

My current regex:

(?im)(?=^(?:\*|.+\*$))^(?:\*[.-]?)?(?:(?!-)[a-z0-9-]+(?:(?<!-)\.)?)+(?:[a-z0-9]+)(?:[.-]?\*)?$

Valid wildcard examples:

*test.com
*.test.com
*test
test.*
test*
*test*

I compile the pattern with:

private static final String WILDCARD_PATTERN = "(?im)(?=^(?:\\*|.+\\*$))^(?:\\*[.-]?)?(?:(?!-)[a-z0-9-]+(?:(?<!-)\\.)?)+(?:[a-z0-9]+)(?:[.-]?\\*)?$";
private static final Pattern wildcard_r = Pattern.compile(WILDCARD_PATTERN);

I look for matches with:

// Wildcards
while (wildcardPatternMatch.find()) {
    String wildcard = wildcardPatternMatch.group();
    myProperty.add(new property(wildcard, providerId));
    System.out.println(wildcard);
}

Are there any changes I can make to improve / optimise the regex or do I need to look at running .replaceAll several times to remove all of the clutter before passing for regex matching?

Upvotes: 1

Views: 696

Answers (2)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626774

The pattern you need is

(?im)^(?=\*|.+\*$)(?:\*[.-]?)?[a-z0-9](?:[a-z0-9-]*[a-z0-9])?(?:\.[a-z0-9](?:[a-z0-9-]*[a-z0-9])?)*(?:[.-]?\*)?$

See the regex demo

Main points:

  • The first lookahead should be after ^. If it is before, the check is done before and after each char in the string. Once it is after ^, it is only performed once at the start of a line
  • The (?:(?!-)[a-z0-9-]+(?:(?<!-)\.)?)+ part, although short, is actually killing performance since the (?:(?<!-)\.)? is optional pattern, and the whole pattern gets reduced to (a+)+, a known type of pattern that causes catastrphic backtracking granted there are other subpatterns to the right of it. You need to unwrap it, the best "linear" way is [a-z0-9](?:[a-z0-9-]*[a-z0-9])?(?:\.[a-z0-9](?:[a-z0-9-]*[a-z0-9])?)*.

The rest is OK.

Details

  • (?im) - case insensitive and multiline modifiers
  • ^ - start of a line
  • (?=\*|.+\*$) - the string should either start or end with *
  • (?:\*[.-]?)? - an optional substring matching a * and an optional . or -` char
  • [a-z0-9](?:[a-z0-9-]*[a-z0-9])? - an alphanumeric char followed with an optional sequence of any 0+ alphanumeric chars or/and - followed with an alphanumeric char
  • (?:\.[a-z0-9](?:[a-z0-9-]*[a-z0-9])?)* - 0 or more sequences of a dot followed with the pattern described above
  • (?:[.-]?\*)? - an optional substring matching an optional . or -char and then a*`
  • $ - end of a line.

Upvotes: 2

Artichoke
Artichoke

Reputation: 94

I'd suggest taking a look at https://en.wikipedia.org/wiki/ReDoS#Evil_regexes

Your regex contains a repeated pattern:

(?:(?!-)[a-z0-9-]+(?:(?<!-)\.)?)+ 

Just as a quick example of how this might slow it down, take a look at the processing time on these two examples: exact matches versus having extra characters at the end and even worse, that set repeated several times

Edit: Another good reference

Upvotes: 1

Related Questions