Matt
Matt

Reputation: 14551

Problem with positive lookbehind and repeating pattern

Consider the following string:

ab(cd.xz) e(ab(fg).xz)) ab(hi.xz)

I want to match every substring that starts after ab( and ends with z. So I've written the following Regular Expression:

(?<=a.*?\().*?z

This should attempt to do the following according to RegexBuddy:

Assert that the regex below can be matched, with the match ending at this position (positive lookbehind) «(?<=a.*?\()»
   Match the character “a” literally «a»
   Match any single character that is not a line break character «.*?»
      Between zero and unlimited times, as few times as possible, expanding as needed (lazy) «*?»
   Match the character “(” literally «\(»
Match any single character that is not a line break character «.*?»
   Between zero and unlimited times, as few times as possible, expanding as needed (lazy) «*?»
Match the character “z” literally «z»

The result I get in RegexBuddy are the following matches (notice the middle one is not working right, as it should match fg).xz). What am I doing wrong?

Upvotes: 3

Views: 495

Answers (2)

Tim Pietzcker
Tim Pietzcker

Reputation: 336478

The regex is working as designed :)

In the second example, the lookbehind expression matches ab(cd.xz) e(. The lookbehind match is always attempted from the start of the string onward (moving ahead if necessary), so the .*? matches more than you think. It is not (as one might expect) actually performed backwards from the current position.

So in the third example, the lookbehind even matches ab(cd.xz) e(ab(fg).xz)) ab(. It just happens to appear to work correctly because the actual match starts after another ab(...

Solution: Be more specific about what you allow to match. I suggest taking parentheses out of the allowed characters:

(?<=a[^()]*\().*?z

Upvotes: 4

Alex Aza
Alex Aza

Reputation: 78537

According to your requirement is "starts after ab( and ends with z", then expression should be:

(?<=ab\().*?z

If you need to match to a*(*z and capture *z only, then this expression will work:

(?<=a[^(]*\().*?z

Upvotes: 0

Related Questions