Reputation: 55
I am trying to create a pattern that starts with a "WORD" and matches all letters, numbers and characters except the original "WORD" until it comes to the "ENDWORD".
In this example, I want it to match the second occurence of "WORD" and match up until the "ENDWORD"; however, it is starting with the first occurence and not properly excluding the second occurence of "WORD".
It appears as though the trick is matching any character except a "WORD". The example below uses a a negative lookahead that is negated by the preceding "." (any), but I am not sure how to combine a positive "any" or newline set with a negative word. Any help would be greatly appreciated.
Here is an example c# program I am running in LinqPad.
void Main() {
var text =
@"WORD
[asdf] ---
123/\*&
WORD
[asdf] ---
123/\*&
ENDWORD
[asdf] ---
123/\*&";
var pattern = $"(WORD).|\\n\\b(?!WORD)\\b.|\\n*(ENDWORD)";
Regex rgx = new Regex(pattern);
foreach (Match match in rgx.Matches(text))
{
match.Dump();
}
}
Another way of stating the problem would be to start from the "ENDWORD" (capture it), back track, ignoring all characters until you find the first occurence of "WORD" and capture it as well. Just modified to clarify the "ENDWORD" is not the end of the string.
Upvotes: 2
Views: 165
Reputation: 3401
That would be a non greedy regex starting from the end:
EDIT: forgot that '.' included all characters but not line feed.
"WORD(\n|.)+?ENDTHING"
with RightToLeft
option:
Regex.Matches(input, pattern, RegexOptions.RightToLeft)
I tested it with your input text on https://rextester.com/tester
Upvotes: 1
Reputation: 55
After looking at this further, there appears to be a simpler solution. I may have caused some confusion in my question since there is no relationship between WORD and ENDWORD other than their positions. Here is the simplified pattern and example.
void Main()
{
var text =
@"WORD
[asdf] ---
123/\*&
WORD
[asdf] ---
123/\*&
ENDTHING
[asdf] ---
123/\*&";
var pattern = $"(WORD)(?:(?!WORD\\b).|\\n)*(ENDTHING)";
Regex rgx = new Regex(pattern);
foreach (Match match in rgx.Matches(text))
{
match.Dump();
}
}
Upvotes: 0
Reputation: 49
I like simple solution
var result = text.Substring(0, text.LastIndexOf("ENDWORD")).Split(new[] {"WORD"},StringSplitOptions.None);
Upvotes: 0
Reputation: 163362
For your example data, you could first match WORD
preceded by a space or tab. Then repeat matching the lines that do not contain WORD until you encounter a line the contains ENDWORD
prededed by a space or tab.
To check if the line does not contain WORD you could use a negative lookahead.
[ \t]WORD\b.*(?:\r?\n(?!.*[ \t](?:END)?WORD\b).*)*\r?\n[ \t]+ENDWORD\b
Explanation
[ \t]
Match a space or tabWORD\b
Match WORD and word boundary.*
Match any char 0+ times except a newline(?:
Non capturing group
\r?\n(?!.*[ \t](?:END)?WORD\b)
Repeat 0+ times a line that does not contain optional END followed by WORD.*
If that is the case, then match the whole line)*
Close non capturing group and repeat 0+ times\r?\n[ \t]+ENDWORD\b
Match a newline, 1+ spaces or tabs and ENDWORD with word boundaryFor example:
var pattern = @"[ \t]WORD\b.*(?:\r?\n(?!.*[ \t](?:END)?WORD\b).*)*\r?\n[ \t]+ENDWORD\b";
Upvotes: 1