Reputation: 147
I'm trying to optimize one of my .NET app's regular expression.
Regex: (?<!WordA\s(?:WordB\s)?)(WordB\s)?WordC
Logic:
Should Match:
Should Not Match:
The expression works but as you can see the WordB is present two times in the expression so I'm trying to remove one of them to get better performance.
Note: "Words" are in fact complex expressions.
Is there any way?
Upvotes: 2
Views: 129
Reputation: 626851
The problem with "optimizing" the (?<!WordA\s(?:WordB\s)?)(WordB\s)?WordC
regex (that is a combination of (?<!WordA\s)WordC
and (?<!WordA\s)WordB\sWordC
) is that WordB
and WordC
are separated with whitespace, and a negative lookbehind does not make the regex engine skip the matched phrase once there is WordB WordC
preceded with WordA
, it only skips the position where it failed, so WordC
will match if you just use (?<!WordA\s)(WordB\s)?WordC
. The lookbehind must restrict both WordB\sWordC
and WordC
that is why you must repeat the optional WordB
in the lookbehind pattern, the same way you would use it in the two "destructured" patterns shown above.
So, with a plain string regex, there is no other way.
A workaround involving some code change can look like
var rx = @"(WordA\s)?(?:WordB\s)?WordC";
var strings = new List<String> {"WordC", "WordB WordC", "WordA WordC", "WordA WordB WordC"};
foreach (var s in strings)
{
var m = Regex.Match(s, rx);
Console.WriteLine("{0}: {1}", s, (m.Groups[1].Success ? "NO MATCH" : m.Value));
}
// => WordC: WordC
// => WordB WordC: WordB WordC
// => WordA WordC: NO MATCH
// => WordA WordB WordC: NO MATCH
See the C# demo.
In the (WordA\s)?(?:WordB\s)?WordC
regex, (WordA\s)?
captures WordA
with a whitespace is captured into Group 1, and if it matches, we know we need to discard the match. If the Group 1 .Success
value is false, it means the match is valid.
Upvotes: 1