Reputation: 497
I have to write a regular expression to get three words from the text. Words are separated with one space. And I wrote the code that gives me not all sequences. For example for text "one two three four five six" I got only two sequences: 1.one two three 2.four five six. But I want my regular expression to give me all sequences so the output would be: 1.one two three 2.two three four 3.three four five. 4.four five six. Can somebody tell me please what's wrong with my regular expression? Here is my code:
string input = "one two three four five six";
string pattern = @"([a-zA-Z]+ ){2}[a-zA-Z]+";
Regex rgx = new Regex(pattern, RegexOptions.IgnoreCase);
MatchCollection matches = rgx.Matches(input);
if (matches.Count > 0)
{
Console.WriteLine("{0} ({1} matches):", input, matches.Count);
Console.WriteLine();
foreach (Match match in matches)
Console.WriteLine(match.Value);
}
Console.ReadLine();
Upvotes: 1
Views: 275
Reputation: 50114
There's nothing wrong with your regular expression - it's just how regular expressions work. When you find a match, the search for the next match continues at the end of the one you just found - the width of the match is consumed.
So, how to fix this? One way is to make your match not consume anything. You can do this by placing your original pattern in a zero-width positive lookahead assertion:
string pattern = @"(?=([a-zA-Z]+ ){2}[a-zA-Z]+)";
added ---> *** *
(?=pattern)
says "only match at this point if it's immediately followed by soemthing matching pattern
" - but the content matching pattern
isn't part of the overall match, so it isn't consumed.
If it's not part of the match, though, it doesn't appear in match.Value
- so how do you get the value out? Simple - just add a capturing group around the original pattern (i.e. (?=(pattern))
), and the captured group will appear in your results as normal.
string pattern = @"(?=(([a-zA-Z]+ ){2}[a-zA-Z]+))";
added ---> * *
So now, you can go through your foreach
loop as before, but match.Value
will be empty - your desired result is in match.Groups[1].Value
.
But now you have another problem. Your results are
one two three
ne two three
e two three
two three four
wo three four
and so on. This is because your pattern matches even when you start halfway through a word.
How to fix this?
We add another zero-width assertion, this time a negative lookbehind: (?<![a-zA-Z])
. Rather than saying "only match if this point is followed by the pattern", it says "never match if this point is preceeded by the pattern". Thus we'll never match at a point preceeded by a letter. ne two three
isn't returned, for example, as it's preceeded by o
.
string pattern = @"(?<![a-zA-Z])(?=(([a-zA-Z]+ ){2}[a-zA-Z]+))";
added ---> *************
With this pattern, you finally get your expected results.
Upvotes: 5