JSams
JSams

Reputation: 55

Need a C# regex pattern that includes any character except excluded word

I am trying to create a pattern that starts with a "WORD" and matches all letters, numbers and characters except the original "WORD" until it comes to the "ENDWORD".

In this example, I want it to match the second occurence of "WORD" and match up until the "ENDWORD"; however, it is starting with the first occurence and not properly excluding the second occurence of "WORD".

It appears as though the trick is matching any character except a "WORD". The example below uses a a negative lookahead that is negated by the preceding "." (any), but I am not sure how to combine a positive "any" or newline set with a negative word. Any help would be greatly appreciated.

Here is an example c# program I am running in LinqPad.

void Main() {

var text =
    @"WORD 
    [asdf] ---
    123/\*&
    WORD
    [asdf] ---
    123/\*&
    ENDWORD
    [asdf] ---
    123/\*&";

var pattern = $"(WORD).|\\n\\b(?!WORD)\\b.|\\n*(ENDWORD)";

Regex rgx = new Regex(pattern);
foreach (Match match in rgx.Matches(text))
{
    match.Dump();
}

}

Another way of stating the problem would be to start from the "ENDWORD" (capture it), back track, ignoring all characters until you find the first occurence of "WORD" and capture it as well. Just modified to clarify the "ENDWORD" is not the end of the string.

Upvotes: 2

Views: 165

Answers (4)

Roubachof
Roubachof

Reputation: 3401

That would be a non greedy regex starting from the end:

EDIT: forgot that '.' included all characters but not line feed.

"WORD(\n|.)+?ENDTHING"

with RightToLeft option:

Regex.Matches(input, pattern, RegexOptions.RightToLeft)

I tested it with your input text on https://rextester.com/tester

Upvotes: 1

JSams
JSams

Reputation: 55

After looking at this further, there appears to be a simpler solution. I may have caused some confusion in my question since there is no relationship between WORD and ENDWORD other than their positions. Here is the simplified pattern and example.

void Main()
{
    var text =
    @"WORD
    [asdf] ---
    123/\*&
    WORD   
    [asdf] ---
    123/\*&
    ENDTHING
    [asdf] ---
    123/\*&";

    var pattern = $"(WORD)(?:(?!WORD\\b).|\\n)*(ENDTHING)";
    Regex rgx = new Regex(pattern);
    foreach (Match match in rgx.Matches(text))
    {
        match.Dump();
    }
}

Upvotes: 0

robert
robert

Reputation: 49

I like simple solution

var result = text.Substring(0, text.LastIndexOf("ENDWORD")).Split(new[] {"WORD"},StringSplitOptions.None);

Upvotes: 0

The fourth bird
The fourth bird

Reputation: 163362

For your example data, you could first match WORD preceded by a space or tab. Then repeat matching the lines that do not contain WORD until you encounter a line the contains ENDWORD prededed by a space or tab.

To check if the line does not contain WORD you could use a negative lookahead.

[ \t]WORD\b.*(?:\r?\n(?!.*[ \t](?:END)?WORD\b).*)*\r?\n[ \t]+ENDWORD\b

Explanation

  • [ \t] Match a space or tab
  • WORD\b Match WORD and word boundary
  • .* Match any char 0+ times except a newline
  • (?: Non capturing group
    • \r?\n(?!.*[ \t](?:END)?WORD\b) Repeat 0+ times a line that does not contain optional END followed by WORD
    • .* If that is the case, then match the whole line
  • )* Close non capturing group and repeat 0+ times
  • \r?\n[ \t]+ENDWORD\b Match a newline, 1+ spaces or tabs and ENDWORD with word boundary

Regex demo | C# demo

For example:

var pattern = @"[ \t]WORD\b.*(?:\r?\n(?!.*[ \t](?:END)?WORD\b).*)*\r?\n[ \t]+ENDWORD\b";

Upvotes: 1

Related Questions