Reputation: 513

difference between two regular expressions: [abc]+ and ([abc])+

In [29]: re.findall("([abc])+","abc")
Out[29]: ['c']

In [30]: re.findall("[abc]+","abc")
Out[30]: ['abc']

Confused by the grouped one. How does it make difference?

Upvotes: 8

Answers (5)

Josh S.

Reputation: 597

Grouping just gives different preference.

([abc])+ => Find one from selection. Can match one or more. It finds one and all conditions are met as the + means 1 or more. This breaks up the regex into two stages.

While the ungrouped one is treated as a whole.

Upvotes: -3

Alan Moore

Reputation: 75222

There are two things that need to be explained here: the behavior of quantified groups, and the design of the findall() method.

In your first example, [abc] matches the a, which is captured in group #1. Then it matches b and captures it in group #1, overwriting the a. Then again with the c, and that's what's left in group #1 at the end of the match.

But it does match the whole string. If you were using search() or finditer(), you would be able to look at the MatchObject and see that group(0) contains abc and group(1) contains c. But findall() returns strings, not MatchObjects. If there are no groups, it returns a list of the overall matches; if there are groups, the list contains all the captures, but not the overall match.

So both of your regexes are matching the whole string, but the first one is also capturing and discarding each character individually (which is kinda pointless). It's only the unexpected behavior of findall() that makes it look like you're getting different results.

Upvotes: 8

C S

Reputation: 1525

Here's the way I would think about it. ([abc])+ is attempting to repeat a captured group. When you use "+" after the capture group, it doesn't mean you are going to get two captured groups. What ends up happening, at least for Python's regex and most implementations, is that the "+" forces iteration until the capture group only contains the last match.

If you want to capture a repeated expression, you need to reverse the ordering of "(...)" and "+", e.g. instead of ([abc])+ use ([abc]+).

Upvotes: 3