Algo
Algo

Reputation: 880

Why is this regular expression matching giving this result?

With the metacharacter + the pattern must at least appear one time. While trying to match a[ab]+ in this string: abbaaabbbbaaaaa using python's re.findall(), I expected it to return all this possible matches starting from the first letter a as in ['ab', 'abb', 'abba', 'abbaaa', ... etc] until reaching the whole string (which is also a match). Furthermore, I think it also applies to every single a in the string not only the first one, so I suppose the matched results will be more than this.

This the code I used:

import re

string = 'abbaaabbbbaaaaa'
matches = re.findall('a[ab]+', string)
for match in matches:
    print(match)

However, the result is only abbaaabbbbaaaaa (the whole string). So What is it that I understand wrong?

Upvotes: 2

Views: 276

Answers (3)

Tim Pietzcker
Tim Pietzcker

Reputation: 336448

Regular expressions only find non-overlapping matches (unless you're using special tricks like positive lookahead assertions with capturing groups).

Furthermore, your + quantifier is greedy by default, matching as many characters as possible. If you add a ? to it, it becomes lazy, so it stops at the first possible point. That gives you a list of non-overlapping matches, which however are not what you're expecting either:

['ab', 'aa', 'ab', 'aa', 'aa']
# as in ABbAAABbbbAAAAa

If you do

matches = re.findall('(?=(a[ab]+))', string)

you get all the matches from each possible starting point in the string:

['abbaaabbbbaaaaa',
 'aaabbbbaaaaa',
 'aabbbbaaaaa',
 'abbbbaaaaa',
 'aaaaa',
 'aaaa',
 'aaa',
 'aa']

By applying the regex recursively to all these submatches, you will then get all the possible matches (which are quite numerous).

Upvotes: 3

elixenide
elixenide

Reputation: 44851

a[ab]+ will match a single string (assuming it matches at all). The entire string abbaaabbbbaaaaa matches that regex, so you get one match: the entire string. It does not give you every little piece that might match.

Put differently, each match of a and [ab] "consumes" a character. That is, the matching character is "used up," and the program moves to the next character. In general, that's what you want: you want to see if the whole string matches, or how much of it matches, as opposed to finding all the bits and pieces that make up a bigger match.

Upvotes: 3

user1919238
user1919238

Reputation:

Brackets are a character class, meaning match any one of these characters.

Therefore, [ab]+ matches one or more characters that are either a or b in a row. Your pattern will gobble up the whole string with a single match.

What you might want is:

re.findall('a(?:ab)+', string)

Note that (?:...) is a non-capturing group. It works the same as (...) would in this pattern, but it is more efficient, since it doesn't save the subgroups (which you don't need).

Upvotes: 0

Related Questions