Reputation: 872
I have the following text:
"cat dog mouse lion"
And I search for "dog" or "mouse" using regex:
Regex regex = new Regex(@"dog|mouse");
The way Regex in C# behaves is that it first searches all the way through for the word dog. If it finds a match, it stops. How do I make it stop after finding the first occurrence of any of my words in the regex, meaning stop after "cat" as this occurs first?
Do I have to make multiple regex searches and match the indexes of the findings? Or is it possible to specify it in the regex expression?
Upvotes: 1
Views: 278
Reputation: 93056
No, you are wrong.
Regex regex = new Regex(@"dog|mouse");
and
Regex regex = new Regex(@"mouse|dog");
both will find the word "dog", even when like in the second case the word "mouse" is the first in the alternation.
The matching behaviour is different, than you described. The regex will check at the first char if it can match the first alternative, if this does not match, it will not continue to the second character, it will try the second alternative.
But, the ordering of the alternation is important in another aspect. You will get problems, when you have alternatives with the same beginnning and you order them from short to long, e.g.
Regex regex = new Regex(@"Foo|Foobar");
this will never match the word "Foobar", since even when there is Foobar in the text it matches on the first alternative "Foo".
To avoid those problems, order it from long to short
Regex regex = new Regex(@"Foobar|Foo");
this will try to match "Foobar" on "Foo" and when it recognizes, there is no "b" following, it tries the second alternative and matches successfully "Foo".
Upvotes: 4
Reputation: 89639
A way to do that is to use a lazy quantifier with dotall option:
Regex regex = new Regex(@"^.*?\b(?>dog|mouse)\b");
Another way is to do that;
Regex regex = new Regex(@"^(?>[^dm]*+|d++(?!og\b)|m++(?!ouse\b))*\b(?>dog|mouse)\b");
it is longer but more efficient. The idea is to avoid lazy quantifier that is slow because it tests on each characters to see what follows. Here i describe the begining as "all that is not a d
or a m
OR some d
not followed by og
OR some m
not followed by ouse
zero or more times.
(?>..)
is an atomic group, this is to avoid that the regex engine backtrack, it is a kind of 'all or nothing', more informations here
++
is a possessive quantifier that avoid backtracks too.
Upvotes: 0