Reputation: 880
With the metacharacter +
the pattern must at least appear one time. While trying to match a[ab]+
in this string: abbaaabbbbaaaaa
using python's re.findall()
, I expected it to return all this possible matches starting from the first letter a
as in ['ab', 'abb', 'abba', 'abbaaa', ... etc]
until reaching the whole string (which is also a match). Furthermore, I think it also applies to every single a
in the string not only the first one, so I suppose the matched results will be more than this.
This the code I used:
import re
string = 'abbaaabbbbaaaaa'
matches = re.findall('a[ab]+', string)
for match in matches:
print(match)
However, the result is only abbaaabbbbaaaaa
(the whole string). So What is it that I understand wrong?
Upvotes: 2
Views: 276
Reputation: 336448
Regular expressions only find non-overlapping matches (unless you're using special tricks like positive lookahead assertions with capturing groups).
Furthermore, your +
quantifier is greedy by default, matching as many characters as possible. If you add a ?
to it, it becomes lazy, so it stops at the first possible point. That gives you a list of non-overlapping matches, which however are not what you're expecting either:
['ab', 'aa', 'ab', 'aa', 'aa']
# as in ABbAAABbbbAAAAa
If you do
matches = re.findall('(?=(a[ab]+))', string)
you get all the matches from each possible starting point in the string:
['abbaaabbbbaaaaa',
'aaabbbbaaaaa',
'aabbbbaaaaa',
'abbbbaaaaa',
'aaaaa',
'aaaa',
'aaa',
'aa']
By applying the regex recursively to all these submatches, you will then get all the possible matches (which are quite numerous).
Upvotes: 3
Reputation: 44851
a[ab]+
will match a single string (assuming it matches at all). The entire string abbaaabbbbaaaaa
matches that regex, so you get one match: the entire string. It does not give you every little piece that might match.
Put differently, each match of a
and [ab]
"consumes" a character. That is, the matching character is "used up," and the program moves to the next character. In general, that's what you want: you want to see if the whole string matches, or how much of it matches, as opposed to finding all the bits and pieces that make up a bigger match.
Upvotes: 3
Reputation:
Brackets are a character class, meaning match any one of these characters.
Therefore, [ab]+
matches one or more characters that are either a or b in a row. Your pattern will gobble up the whole string with a single match.
What you might want is:
re.findall('a(?:ab)+', string)
Note that (?:
...)
is a non-capturing group. It works the same as (
...)
would in this pattern, but it is more efficient, since it doesn't save the subgroups (which you don't need).
Upvotes: 0