.netregexlookbehindregex-lookaroundsbalancing-groups

Reputation: 4030

match regex with variable length look-behind of a word and variable length negative look-behind of another word?

I have a regular expression that captures a pattern A only if it the string contains a pattern B somewhere before A.

Let's say, for the sake of simplicity, that A is \b\d{3}\b (i.e. three digits) and B is the word "foo".

Therefore the Regex I have is (?<=\b(?:foo)\b.*?)(?<A>\b\d{3}\b).

(?<=               # look-behind
    \b(?:foo)\b    # pattern B
    .*?            # variable length
)
(?<A>\b\d{3}\b)    # pattern A

For example, for the string

"foo text 111, 222 and not bar something 333 but foo 444 and better 555"

it captures

(111, 222, 333, 444, 555)

I got a new requirement and now I have to exclude the captures that are preceded by pattern C, lets say that C is the word "bar". What I want to build is a regex that expresses

(?<=               # look-behind
    \b(?:foo)\b    # pattern B
    ???????????    # anything that does not contains pattern C
)
(?<A>\b\d{3}\b)    # pattern A

So, in the example string I will have to capture

(111, 222, 444, 555)

Of course something like (?<=\b(?:foo)\b.*?)(?<!\b(?:bar)\b.*?)(?<A>\b\d{3}\b)

(?<=               # look-behind
    \b(?:foo)\b    # pattern B
    .*?
)
(?<!               # negative look-behind
    \b(?:bar)\b    # pattern C
    .*?
)
(?<A>\b\d{3}\b)    # pattern A

will not work as it will exclude everything after the first appearance of "bar" and the capture will be

(111, 222)

The regex (?<=\b(?:foo)\b(?!.*?(?:\bbar\b)).*?)(?<A>\b\d{3}\b)

(?<=                     # look-behind
    \b(?:foo)\b          # pattern B
    (?!                  # negative lookahead
        .*?              # variable lenght
        (?:\bbar\b)      # pattern C
    )
    .*?                  # variable lenght
)
(?<A>\b\d{3}\b)          # pattern A

will not work either because for the first "foo" in my test string, it will always find the "bar" as a suffix and it will only capture

(444, 55)

So far, using Conditional Matching of Expressions and (now) knowing that while inside a lookbehind, .net matches and captures from the right to the left, I was able to create the following regex (?<=(?(C)(?!)| (?:\bfoo\b))(?:(?<!\bbar)\s|(?<C>\bbar\s)|[^\s])*)(?<A>\b\d{3}\b)

(?<=                     # look-behind
    (?(C)                # if capture group C is not empty
        (?!)             # fail (pattern C was found)
        |                # else
        (?:\bfoo\b)      # pattern B
    )
    (?:
        (?<!\bbar)\s     # space not preceeded by pattern C (consume the space)
        |
        (?<C>\bbar\s)    # pattern C followed by space (capture in capture group C)
        |
        [^\s]            # anything but space (just consume)
    )*                   # repeat as needed
)
(?<A>\b\d{3}\b)          # pattern A

which works but is too complex as the patters A, B and C are a lot more complex that the examples I have posted here.

Is it possible to simplify this regex? Maybe using balancing groups?

Upvotes: 3

Answers (3)

Casimir et Hippolyte

Reputation: 89547

You can use a pattern based on the \G anchor that matches the position after the previous match:

(?:\G(?!\A)|\bfoo\b)(?:(?!\b(?:bar|\d{3})\b).)*(\d{3})

demo

details:

(?:
    \G(?!\A) # contiguous to a previous match and not at the start of the string
  |        # OR
    \bfoo\b  # foo: the condition for the first match
)
(?:(?!\b(?:bar|\d{3})\b).)* # all that is not "bar" or a 3 digit number (*)
(\d{3})

(*) Note that if you can use a better subpattern (i.e. that doesn't test each characters with a lookahead containing an alternation) for your real situation, don't hesitate to change it. (for example, something based on character classes: [^b\d]*(?>(?:\B[b\d]+|b(?!ar\b)|\d(?!\d\d\b))[^b\d]*)*)

An other way: Since .net regex engine is able to store repeated captures, you can write this too:

\bfoo\b(?:(?:(?!\b(?:bar|\d{3})\b).)*(\d{3}))+

But this time, you need to loop over each occurrence of foo to extract results in group 1. It's less handy but the pattern is faster since it doesn't start with an alternation.

Note that if "bar" and "\d{3}" starts and ends with word characters, you can write the pattern in a more efficient way:

\bfoo(?:\W+(?>(?!bar\b)\w+\W+)*?(\d{3}))+\b

Other way: split your string on "foo" and "bar" (preserve the delimiter), loop over each part. When the part is "foo" set a flag to true, when the part is "bar" set it to false, and when it isn't "foo" or "bar" extract the numbers if the flag is true.

Upvotes: 3

Kobi

Reputation: 138007

Since you've asked, it is possible with balancing groups, but probably not needed.

\A                    # Match from the start of the string
(?>                   # Atomic group. no backsies.
    (?<B>(?<-B>)?foo)            # If we see "foo", push it to stack B.
                                 # (?<-B>)? ensures B only has one item - if there are two,
                                 # one is popped.
    |(?<-B>bar)                  # When we see a bar, reset the foo.
    |(?(B)(?<A>\b\d{3}\b)|(?!))  # If foo is set, we are allowed to capture A.
    |.                           # Else, just advance by one character.
)+
\z                    # Match until the end of the string.

Working example

If we wanted to be extra clever (which we probably don't), we can combine most branches into the conditional:

\A
(?>
  (?(B)
    (?:(?<A>\b\d{3}\b)|(?<-B>bar))
    | # else
    (?<B>foo)
  )
  |.
)+
\z

Working example

Again, it is possible, but balancing groups are not the best option here, mainly because we are not balancing anything, just checking if a flag is set or not.

Upvotes: 2

Kobi

Reputation: 138007

One simple option is very similar to Casimir et Hippolyte's second pattern:

foo(?>(?<A>\b\d{3}\b)|(?!bar).)+

Start with foo
(?>…|(?!bar).)+ - Stop matching if you've seen bar.
(?<A>\b\d{3}\b) and capture all A's that you see along the way.
Atomic group (?>) isn't necessary in this case, backtracking wouldn't mess this up either way.

Working example

Similarly, it can be converted to a lookbehind:

(?<=foo(?:(?!bar).)*?)(?<A>\b\d{3}\b)

This has the benefit of matching only the numbers. The lookbehind asserts there is a foo before A, but there isn't an bar.
Working example

Both of these assume B and C are somewhat simple.

Upvotes: 2

match regex with variable length look-behind of a word and variable length negative look-behind of another word?

Answers (3)

Related Questions