Reputation:
I have some strings like "aaa bbb ccc"
, "aaa bbb ccc ddd
", "aaa bbb ccc ddd eee
"....
I need a regex so that I can't get rid of aaa bbb
and get everything else.
I'm trying '\w+\s\w+\s(\w+|\s)+'
but it's not working.
In [171]: r = re.search('\w+\s\w+\s(\w+|\s)+', 'aaa bbb ccc ddd')
In [172]: r.group(0)
Out[172]: 'aaa bbb ccc ddd'
In [173]: r.group(1)
Out[173]: 'ddd'
I'd expect it to return ccc ddd
Upvotes: 1
Views: 84
Reputation: 23186
The issue here is that you have not told the regular expression that the group should encompass all the repeats of \w+|\s
... as your +
is outside of the parentheses.
Instead, try:
>>> r = re.search('\w+\s\w+\s((?:\w+|\s)+)', 'aaa bbb ccc ddd')
>>> r.group(1)
>>> 'ccc ddd'
Note that in this expression, the (?: ...)
are non-capturing parentheses
Upvotes: 0
Reputation: 54213
Your method doesn't work because repeating capturing groups replaces the previous capture. If you make that a non-capturing group (including the quantifier) and wrap a capturing group around it, it should work.
re.compile(r"""
(?:\w+\s){2} # two words we don't care about
( # begin capturing
(?:\w+\s?)+ # 1+ word chars followed by an optional space, 1+ times
) # stop capturing""", re.X)
Although I'm not sure why you're using regular expressions for this. Isn't str.split
better?
s = 'aaa bbb ccc ddd'
result = s.split()[2:]
Upvotes: 1