JaedenRuiner
JaedenRuiner

Reputation: 359

Regex pattern - Negative lookbehind ignored when followd by Optional group?

Sample text:

<td>${myvalue.startDt}</td>
<td>Information Added on ${myvalue.createDt}</td>
<script>
  var myvar = {a:"${myvalue.startDt}"} 
</script>
<input type=text value='${myvalue.endDt}' />

So i wrote a simple reg ex that is supposed to use negative look behind:

(?<!\:)(['"]?\$\{[A-Za-z0-9\.]+Dt\}['"]?)

I need to find all the files that contain the ${...Dt} pattern, but not any that are preceded by a: or b: etc. The above version was not my final but I broke it down to the above because of the insanity around the regex rules. ? is supposed to be a GREEDY match. All the docs say ? will match 0 or 1 times matching as many as possible. So the above says the second group should ALWAYS include the ' or the " before it.

This rule is completely obliterated by the negative lookbehind. If I remove the ? after the ['"] it matches only the value='$...' line at the end. But if I include the ? after the ['"], the ? ceases to be greedy for the {a:""} line dropping the " and matching the ${..}" string so that it can claim a match.

How do I get the regex to ALWAYS be greedy on the ? for group 2, but NEVER be greedy for the negative lookbehind elimination.

The final regex should match all of the ${..} lines EXCEPT the {a:"${myvalue.startDt}"} line. (and yes, I will need to group the ${} values for replacement patterns later.)

PS: This is NOT a duplicate so please don't flag it as one. Tragically, moderators and advanced users here have the power to misinterpret questions, and community posters like myself have no recourse to prove them wrong or to remove their incorrect duplicate flag.
This is NOT about javascript, nor is it about the simple "how to use negative lookbehind". I know how to use it, but an oddity occurred when doing so.
This is about the negative lookbehind not working when you follow it with a ? optional. I need a way to stop that behavior, so the negative lookbehind is ALWAYS obeyed, regardless if it "can be ignored" by not matching the following ? optional match.

Upvotes: 0

Views: 305

Answers (1)

melpomene
melpomene

Reputation: 85867

Greediness just affects in which order the alternatives are tried. The key part is "as many as possible" - a greedy quantifier will give back matches and match less if doing so is required to make the whole regex match.

If your text is

a:"${myvalue.startDt}"

then a regex of ['"]?\$\{[A-Za-z0-9\.]+Dt\}['"]? could match the following parts (because both quotes are optional):

a:"${myvalue.startDt}"
  ^^^^^^^^^^^^^^^^^^^^ | variant 1
  ^^^^^^^^^^^^^^^^^^^  | variant 2
   ^^^^^^^^^^^^^^^^^^^ | variant 3
   ^^^^^^^^^^^^^^^^^^  | variant 4

What actually happens in practice is that variant 1 is chosen. This is because a regex always finds the leftmost match first (ruling out 3 and 4), and the last ? is greedy, so including the " in the match is tried first (ruling out 2).

But if your regex is (?<!\:)(['"]?\$\{[A-Za-z0-9\.]+Dt\}['"]?), then you get the following possibilities:

a:"${myvalue.startDt}"
   ^^^^^^^^^^^^^^^^^^^ | variant 1
   ^^^^^^^^^^^^^^^^^^  | variant 2

The regex match cannot start at " because the preceding character is :, which the negative lookbehind rules out.

So the next position is tried: ['"] does not match $, but ? also allows zero matches of ['"], so that's what happens next. Then \$ matches $ and the rest proceeds as usual.

At the end we select variant 1 because the second ? is greedy and there's nothing else that prevents this match from succeeding.


What you can use instead is ((?<!:)["']|(?<!["':])): Match either a " or ' not preceded by a : (the quoted case), or match an empty string not preceded by " or ' or : (the non-quoted case).

Upvotes: 1

Related Questions