user10400458
user10400458

Reputation:

Regex lookahead with parentheses

What exactly is re.findall('(?=(b))','bbbb') doing? It returns ['b', 'b', 'b', 'b'], but I expected ['b', 'b', 'b'], since it should only return a 'b' if it sees another 'b' ahead?

Thanks!

Edit: It seems that re.findall('b(?=(b))','bbbb') returns ['b', 'b', 'b'] like I would expect, but I am still confused as to what re.findall('(?=(b))','bbbb') does.

Edit 2: Got it! Thank you for the responses.

Upvotes: 3

Views: 1730

Answers (3)

The fourth bird
The fourth bird

Reputation: 163517

A positive lookahead (?= asserts a position which is found 4 times because there are 4 positions where a b follows. In that assertion itself you capture a (b) in a capturing group which will be returned by findall.

If you want to return three times a b and you are not referring to the group anymore, you could match b and add a lookahead that asserts what is on the right side is a b

print(re.findall('b(?=b)','bbbb'))

Demo

Upvotes: 1

CertainPerformance
CertainPerformance

Reputation: 371049

You have a zero-length match there, and you have a capturing group. When the regular expression for re.findall has a capturing group, the resulting list will be what's been captured in those capturing groups (if anything).

Four positions are matched by your regex: the start of the string, before the first b, before the second b, and before the third b. Here's a diagram, where | represents the position matched (spaces added for illustration):

 b b b b
|         captures the next b, passes

 b b b b
  |       captures the next b, passes

 b b b b
    |     captures the next b, passes

 b b b b
      |   captures the next b, passes

 b b b b
        | lookahead fails, match fails

If you didn't want a capturing group and only want to match the zero-length positions instead, use (?: instead of ( for a non-capturing group:

(?=(?:b))

(though the resulting list will be composed of empty strings and won't be very useful)

Upvotes: 2

Jean-François Fabre
Jean-François Fabre

Reputation: 140276

The problem is that the capturing group is inside the lookahead.

To do what you want you have to capture the letter, then use a lookahead that doesn't capture:

re.findall('(b)(?=b)','bbbb')

result:

['b', 'b', 'b']

Upvotes: 2

Related Questions