Why do substrings prevent match with negative lookahead?

Question

Consider the following test data:

x.foo,x.bar
y.foo,y.bar
yy.foo,yy.bar

x.foo,y.bar
y.foo,x.bar
yy.foo,x.bar
x.foo,yy.bar
yy.foo,y.bar
y.foo,yy.bar

I'm attempting to write a regular expression where the string before .foo and the string before .bar are different from each other. The first three items should not match. The other six should.

This mostly works:

^(.+?)\.foo,(?!\1)(.+?)\.bar$

However, it misses on the last one, because y is in match group 1, and thus yy is not matched in match group 2.

Interactive: https://regex101.com/r/Pv5062/1

How can I modify the negative lookahead pattern such that the last item matches as well?

Wiktor Stribiżew · Accepted Answer

Inline backreferences do not store the context information, they only keep the text captured. You need to specify the context yourself.

You may add a dot after \1:

^(.+?)\.foo,(?!\1\.)(.+?)\.bar$
                 ^^

Or, even repeat the part after the second (.+?):

^(.+?)\.foo,(?!\1\.bar$)(.+?)\.bar$

Or, if the bar part cannot contain ., you may make it more "generic":

^(.+?)\.foo,(?!\1\.[^.]+$)(.+?)\.bar$

See the regex demo and another regex demo.

The point is: your (?!\1) is not "anchored" and will fail the match in case the text stored in Group 1 appears immediately to the right of the current location regardless of the context. To solve this, you need to provide this context. As the value that can be matched with .+? can contain virtually anything all you can rely on is the "hardcoded" bits after the lookahead.

Why do substrings prevent match with negative lookahead?

Answers (1)

Related Questions