Reputation: 14898
Given the following regular expression and subject text, why is the negative lookahead only applying to the last character of the named capture group URL
?
// Regex
(?<URL>(?<Protocol>\w+):\/\/(?<Domain>[\w@][\w.:@]+)\/?[\w\.?=%&=\-@/$,]*)(?!'|"|(</a))
// Subject text
<p><a href="http://example.com">http://example.com</a> and http://example.com</p>
This regex has a negative lookahead (?!"|(</a))
which is an attempt to not match URLs that are within a <a>
tag. This is done by checking if the URL is followed by a quote ('
or "
) or a closing </a
tag.
I'm getting the following results
http://example.co
http://example.co
http://example.com
I expected the negative lookahead to apply to the whole capture group, not just it's last char. Is this possible? What am I doing wrong? I expected to match only the last instance of http://example.com
to be captured.
Upvotes: 3
Views: 1237
Reputation: 33928
Because when the negative lookahead fails the quantifiers (and anything else that can) will backtrack, till it finds a match.
You can force an expression not to backtrack by using atomic groups (?>expression)
:
(?<URL>(?>(?<Protocol>\w+):\/\/(?<Domain>[\w@][\w.:@]+)\/?[\w\.?=%&=\-@/$,]*))(?!'|"|(</a))
Upvotes: 3