Reputation: 467
I currently have two separate regex patterns to find target word+next word and target word+previous word:
string text = "Here is a test MYWORD statement for MYWORD regex";
string pattern = "(\\bMYWORD\\s)(\\w+)"; //MYWORD statement; MYWORD regex
string pattern = "(\\w+)(\\s\\bMYWORD)"; //test MYWORD; for MYWORD
Does regex provide an elegant method to combine the two patterns above for use with a single call?
Thanks
EDIT: Many thanks to m.buettner and Qtax for the great explanations and examples - very useful!
I've tried with some of the examples provided, and these match for 'MYWORD' in the required context, but perhaps I've not been clear enough: I am trying to return all the phrases commented above ie:
Matches(pattern) should return all of the following strings:
'MYWORD statement'
'MYWORD regex'
'test MYWORD'
'for MYWORD'
Apologies if my original question didn't explain that well enough!
Upvotes: 0
Views: 3647
Reputation: 75252
Do the match inside a lookahead:
string pattern = @"\b(?=(\w+\s+MYWORD|MYWORD\s+\w+)\b)";
string[] result = Regex.Matches(text, pattern)
.Cast<Match>()
.Select(match => match.Groups[1].Value)
.ToArray();
This regex doesn't consume any characters when it matches, which makes overlapping matches possible. You don't have to worry about infinite loops because the regex engine automatically bumps ahead one position before it starts looking for the next match. And the capturing group still works like normal.
If you need to handle matches at the beginning and end of the string like the other responders mentioned, this should do it:
string pattern = @"\b(?=((?:^|\w+\s+)MYWORD|MYWORD(?:\s+\w+|$))\b)";
UPDATE: A commenter has asked how to capture the preceding and following words without including the target word. The answer turns out to be simple but not obvious:
string pattern = @"\b(?=((\w+)\s+MYWORD|MYWORD\s+(\w+))\b)";
string[] result = Regex.Matches(text, pattern)
.Cast<Match>()
.Select(match => match.Groups[2].Value + match.Groups[3].Value)
.ToArray();
The simple part is adding capturing groups for the individual words. The non-obvious part is realizing that in .NET, if a capturing group doesn't participate in the match, and you access its Value
property, you get an empty string. We know only one of the two groups will participate in each match. We don't need to know which one it was, we just want its value. Concatenating the string values gives us exactly what we want.
But it gets better:
string[] result = Regex.Matches(text, pattern)
.Cast<Match>()
.Select(match => match.Result("$2$3"))
.ToArray();
The Result()
method doesn't get used much because the rest of .NET's Regex API is so well designed, but when it's useful, it's brilliant!
Upvotes: 5
Reputation: 33928
For your example something simple as this would work:
(\w+)\sMYWORD\s(\w+)
But that requires that there are words on both sides of MYWORD
.
If there may not be a word on some side, you could make them optional like:
(?:(\w+)\s)?\bMYWORD\b(?:\s(\w+))?
But that will match a MYWORD
with no words around it.
If you want to match a MYWORD
with at least one word around it, you could use:
(?:(\w+)\sMYWORD\b(?:\s(\w+))?|\bMYWORD\s(\w+))
Altho here the word on the right wold either be in group 2 or 3.
Upvotes: 0
Reputation: 44289
First of all, some advice: use verbatim strings. They make escapes much nicer to deal with:
string pattern = @"(\bMYWORD\s)(\w+)"; //MYWORD statement; MYWORD regex
string pattern = @"(\w+)(\s\bMYWORD)"; //test MYWORD; for MYWORD
Note that your second pattern has the word boundary at the wrong end:
string pattern = @"(\w+)(\sMYWORD\b)"; //test MYWORD; for MYWORD
Now, the naive approach is simply this:
string pattern = @"(\w+)\s(MYWORD)\s(\w+)";
This has a few problems. First, it requires both words to be there, so if MYWORD
appears one end of the string, you won't get any match. This can be fixed by allowing for anchors instead of words:
string pattern = @"(?:(\w+)\s|^)(MYWORD)(?:\s(\w+)|$)";
Now there is one problem left. Matches cannot overlap. If you have abc MYWORD def MYWORD ghi
, the second MYWORD
won't match. You can fix this by excluding the surrounding words from the match, using lookarounds:
string pattern = @"(?<=(\w+)\s|^)(MYWORD)(?=\s(\w+)|$)";
If you want to allow for matches, that are neither at an end of the string nor have an adjacent word (like foo. MYWORD bar
, where the .
"blocks off" the preceding word). simply make the lookarounds optional. If they can match, they will be included, and if not they won't cause the pattern to fail:
string pattern = @"(?<=(\w+)\s)?(MYWORD)(?=\s(\w+))?";
Upvotes: 2