hampadampadoo
hampadampadoo

Reputation: 143

lookahead in the middle of regex doesn't match

I have a string $s1 = "a_b"; and I want to match this string but only capture the letters. I tried to use a lookahead:

if($s1 =~ /([a-z])(?=_)([a-z])/){print "Captured: $1, $2\n";}

but this does not seem to match my string. I have solved the original problem by using a (?:_)instead, but I am curious to why my original attempt did not work? To my understanding a lookahead matches but do not capture, so what did I do wrong?

Upvotes: 2

Views: 1875

Answers (1)

revo
revo

Reputation: 48711

A lookahead looks for next immediate positions and if a true-assertion takes place it backtracks to previous match - right after a - to continue matching. Your regex would work only if you bring a _ next to the positive lookahead ([a-z])(?=_)_([a-z])

You even don't need (non-)capturing groups in substitution:

if ($s1 =~ /([a-z])_([a-z])/) { print "Captured: $1, $2\n"; }

Edit

In reply to @Borodin's comment

I think that moving backwards is the same as a backtrack which is more recognizable by debugging the whole thing (Perl debug mode):

Matching REx "a(?=_)_b" against "a_b"
.
.
.
   0 <> <a_b>                |   0| 1:EXACT <a>(3)
   1 <a> <_b>                |   0| 3:IFMATCH[0](9)
   1 <a> <_b>                |   1|  5:EXACT <_>(7)
   2 <a_> <b>                |   1|  7:SUCCEED(0)
                             |   1|  subpattern success...
   1 <a> <_b>                |   0| 9:EXACT <_b>(11)
   3 <a_b> <>                |   0| 11:END(0)
Match successful!

As above debug output shows at forth line of results (when 3rd step took place) engine consumes characters a_ (while being in a lookahead assertion) and then we see a backtrack happens after successful assertion of positive lookahead, engine skips whole sub-pattern in a reverse manner and starts at the position right after a.

At line #5, engine has consumed one character only: a. Regex101 debugger:

enter image description here

How I interpret this backtrack is more clear in this illustration (Thanks to @JDB, I borrowed his style of representation)

a(?=_)_b
*
|\
| \
|  : a (match)
|  * (?=_)
|  |↖
|  | ↖
|  |↘ ↖
|  | ↘ ↖
|  |  ↘ ↖
|  |   : _ (match)
|  |     ^ SUBPATTERN SUCCESS (OP_ASSERT :=> MATCH_MATCH)
|  * _b
|  |\
|  | \
|  |  : _ (match)
|  |  : b (match)
|  | /
|  |/
| /
|/
MATCHED

By this I mean if lookahead assertion succeeds - since extraction of parts of input string is happened - it goes back upward (back to previous match offset - (eptr (pointer into the subject) is not changed but offset is) and while resetting consumed chars it tries to continue matching from there and I call it a backtrack. Below is a visual representation of steps taken by engine with use of Regexp::Debugger

enter image description here

So I see it a backtrack or a kind of, however if I'm wrong with all these said, then I'd appreciate any reclaims with open arms.

Upvotes: 6

Related Questions