Weird behavior of negative look ahead

Question

I have the following string: "text before AB000CD000CD text after". I want to match text from AB to the first occurrence of CD. Inspired by this answer, I created the following regex pattern:

AB((?!CD).)*CD

I checked the result in https://regex101.com/ and the output is:

Full match  12-19   `AB000CD`
Group 1.    16-17   `0`

Looks like it does what I need. However I don't understand why it works. My understanding is that my pattern should match AB first, then any character that is not followed by CD, and then CD itself. But following this logic, the result should not include 000, but only 00 because the last zero is actually followed by CD. Is my explanation wrong?

Wiktor Stribiżew · Accepted Answer

AB((?!CD).)*CD matches AB, then any char that does not start a CD char sequence, and then CD. That is where you are wrong saying "that is not followed by CD". Note that the negative lookahead is located before the ..

Besides, it makes no sense using the tempered greedy token when the negated part is the same as the trailing boundary, just use a lazy dot matching pattern, AB(.*?)CD. You need to use the construct when you do not want to match AB (the initial boundary) in between the AB and CD, ie. AB((?:(?!AB).)*?)CD (it the most common use case).

See rexegg.com reference about when to use it:

Suppose our boss now tells us that we still want to match up to and including {END}, but that we also need to avoid stepping over a {MID} section, if it exists. Starting with the lazy dot-star version to ensure we match up to the {END} delimiter, we can then temper the dot to ensure it doesn't roll over {MID}:

{START}(?:(?!{MID}).)*?{END}

If more phrases must be avoided, we just add them to our tempered dot:

{START}(?:(?!{MID})(?!{RESTART}).)*?{END}

Also, see this thread.

Weird behavior of negative look ahead

Answers (1)

Related Questions