Yaakov Ellis
Yaakov Ellis

Reputation: 41500

RegEx to match a pattern, as long as it is not preceded by a different pattern

I need a regex that is to be used for text substitution. Example: text to be matched is ABC (which could be surrounded by square brackets), substitution text is DEF. This is basic enough. The complication is that I don't want to match the ABC text when it is preceded by the pattern \[[\d ]+\]\. - in other words, when it is preceded by a word or set of words in brackets, followed by a period.

Here are some examples of source text to be matched, and the result, after the regex substitution would be made:

1. [xxx xxx].[ABC] > [xxx xxx].[ABC] (does not match - first part fits the pattern)
2. [xxx xxx].ABC   > [xxx xxx].ABC   (does not match - first part fits the pattern)
3. [xxx.ABC        > [xxx.DEF        (matches - first part has no closing bracket)
4. [ABC]           > [DEF]           (matches - no first part)
5. ABC             > DEF             (matches - no first part)
6. [xxx][ABC]      > [xxx][DEF]      (matches - no period in between)
7. [xxx]. [ABC]    > [xxx] [DEF]     (matches - space in between)

What it comes down to is: how can I specify the preceding pattern that when present as described will prevent a match? What would the pattern be in this case? (C# flavor of regex)

Upvotes: 8

Views: 6331

Answers (1)

Jeremy W. Sherman
Jeremy W. Sherman

Reputation: 36143

You want a negative look-behind expression. These look like (?<!pattern), so:

(?<!\[[\d ]+\]\.)\[?ABC\]?

Note that this does not force a matching pair of square brackets around ABC; it just allows for an optional open bracket before and an optional close bracket after. If you wanted to force a matching pair or none, you'd have to use alternation:

(?<!\[[\d ]+\]\.)(?:ABC|\[ABC\])

This uses non-capturing parentheses to delimit the alternation. If you want to actually capture ABC, you can of turn that into a capture group.

ETA: The reason the first expression seems to fail is that it is matching on ABC], which is not preceded by the prohibited text. The open bracket [ is optional, so it just doesn't match that. The way around this is to shift the optional open bracket [ into the negative look-behind assertion, like so:

(?<!\[[\d ]+\]\.\[?)ABC\]?

An example of what it matches and doesn't:

[123].[ABC]: fail (expected: fail)
[123 456].[ABC]: fail (expected: fail)
[123.ABC: match (expected: match)
    matched: ABC
ABC: match (expected: match)
    matched: ABC
[ABC]: match (expected: match)
    matched: ABC]
[ABC[: match (expected: fail)
    matched: ABC

Trying to make the presence of an open bracket [ force a matching close bracket ], as the second pattern intended, is trickier, but this seems to work:

(?:(?<!\[[\d ]+\]\.\[)ABC\]|(?<!\[[\d ]+\]\.)(?<!\[)ABC(?!\]))

An example of what it matches and doesn't:

[123].[ABC]: fail (expected: fail)
[123 456].[ABC]: fail (expected: fail)
[123.ABC: match (expected: match)
    matched: ABC
ABC: match (expected: match)
    matched: ABC
[ABC]: match (expected: match)
    matched: ABC]
[ABC[: fail (expected: fail)

The examples were generated using this code:

// Compile and run with: mcs so_regex.cs && mono so_regex.exe
using System;
using System.Text.RegularExpressions;

public class SORegex {
  public static void Main() {
    string[] values = {"[123].[ABC]", "[123 456].[ABC]", "[123.ABC", "ABC", "[ABC]", "[ABC["};
    string[] expected = {"fail", "fail", "match", "match", "match", "fail"};
    string pattern = @"(?<!\[[\d ]+\]\.\[?)ABC\]?";  // Don't force [ to match ].
    //string pattern = @"(?:(?<!\[[\d ]+\]\.\[)ABC\]|(?<!\[[\d ]+\]\.)(?<!\[)ABC(?!\]))";  // Force balanced brackets.
    Console.WriteLine("pattern: {0}", pattern);
    int i = 0;
    foreach (string text in values) {
      Match m = Regex.Match(text, pattern);
      bool isMatch = m.Success;
      Console.WriteLine("{0}: {1} (expected: {2})", text, isMatch? "match" : "fail", expected[i++]);
      if (isMatch) Console.WriteLine("\tmatched: {0}", m.Value);
    }
  }
}

Upvotes: 19

Related Questions