Reputation: 341
I am trying to use a multi-line regex to match all wildcards in a given source string. These strings can be in excess of 70,000 lines and each item is separated by a new line.
I seem to be experiencing huge processing times for my current regex and I can only assume that this is because it is probably poorly constructed and inefficient. If I execute the code on my phone it seems to run for an eternity.
My current regex:
(?im)(?=^(?:\*|.+\*$))^(?:\*[.-]?)?(?:(?!-)[a-z0-9-]+(?:(?<!-)\.)?)+(?:[a-z0-9]+)(?:[.-]?\*)?$
Valid wildcard examples:
*test.com
*.test.com
*test
test.*
test*
*test*
I compile the pattern with:
private static final String WILDCARD_PATTERN = "(?im)(?=^(?:\\*|.+\\*$))^(?:\\*[.-]?)?(?:(?!-)[a-z0-9-]+(?:(?<!-)\\.)?)+(?:[a-z0-9]+)(?:[.-]?\\*)?$";
private static final Pattern wildcard_r = Pattern.compile(WILDCARD_PATTERN);
I look for matches with:
// Wildcards
while (wildcardPatternMatch.find()) {
String wildcard = wildcardPatternMatch.group();
myProperty.add(new property(wildcard, providerId));
System.out.println(wildcard);
}
Are there any changes I can make to improve / optimise the regex or do I need to look at running .replaceAll
several times to remove all of the clutter before passing for regex matching?
Upvotes: 1
Views: 696
Reputation: 626774
The pattern you need is
(?im)^(?=\*|.+\*$)(?:\*[.-]?)?[a-z0-9](?:[a-z0-9-]*[a-z0-9])?(?:\.[a-z0-9](?:[a-z0-9-]*[a-z0-9])?)*(?:[.-]?\*)?$
See the regex demo
Main points:
^
. If it is before, the check is done before and after each char in the string. Once it is after ^
, it is only performed once at the start of a line(?:(?!-)[a-z0-9-]+(?:(?<!-)\.)?)+
part, although short, is actually killing performance since the (?:(?<!-)\.)?
is optional pattern, and the whole pattern gets reduced to (a+)+
, a known type of pattern that causes catastrphic backtracking granted there are other subpatterns to the right of it. You need to unwrap it, the best "linear" way is [a-z0-9](?:[a-z0-9-]*[a-z0-9])?(?:\.[a-z0-9](?:[a-z0-9-]*[a-z0-9])?)*
.The rest is OK.
Details
(?im)
- case insensitive and multiline modifiers^
- start of a line(?=\*|.+\*$)
- the string should either start or end with *
(?:\*[.-]?)?
- an optional substring matching a *
and an optional .
or -` char[a-z0-9](?:[a-z0-9-]*[a-z0-9])?
- an alphanumeric char followed with an optional sequence of any 0+ alphanumeric chars or/and -
followed with an alphanumeric char(?:\.[a-z0-9](?:[a-z0-9-]*[a-z0-9])?)*
- 0 or more sequences of a dot followed with the pattern described above(?:[.-]?\*)?
- an optional substring matching an optional .
or -char and then a
*`$
- end of a line.Upvotes: 2
Reputation: 94
I'd suggest taking a look at https://en.wikipedia.org/wiki/ReDoS#Evil_regexes
Your regex contains a repeated pattern:
(?:(?!-)[a-z0-9-]+(?:(?<!-)\.)?)+
Just as a quick example of how this might slow it down, take a look at the processing time on these two examples: exact matches versus having extra characters at the end and even worse, that set repeated several times
Edit: Another good reference
Upvotes: 1