user4093955
user4093955

Reputation:

Regex for matching the third,fourth,fifth... word

I have some strings like "aaa bbb ccc", "aaa bbb ccc ddd", "aaa bbb ccc ddd eee"....

I need a regex so that I can't get rid of aaa bbb and get everything else.

I'm trying '\w+\s\w+\s(\w+|\s)+' but it's not working.

In [171]: r = re.search('\w+\s\w+\s(\w+|\s)+', 'aaa bbb ccc ddd')

In [172]: r.group(0)
Out[172]: 'aaa bbb ccc ddd'

In [173]: r.group(1)
Out[173]: 'ddd'

I'd expect it to return ccc ddd

Upvotes: 1

Views: 84

Answers (2)

donkopotamus
donkopotamus

Reputation: 23186

The issue here is that you have not told the regular expression that the group should encompass all the repeats of \w+|\s ... as your + is outside of the parentheses.

Instead, try:

>>> r = re.search('\w+\s\w+\s((?:\w+|\s)+)', 'aaa bbb ccc ddd')
>>> r.group(1)
>>> 'ccc ddd'

Note that in this expression, the (?: ...) are non-capturing parentheses

Upvotes: 0

Adam Smith
Adam Smith

Reputation: 54213

Your method doesn't work because repeating capturing groups replaces the previous capture. If you make that a non-capturing group (including the quantifier) and wrap a capturing group around it, it should work.

re.compile(r"""
    (?:\w+\s){2}        # two words we don't care about
    (                   # begin capturing
      (?:\w+\s?)+       #   1+ word chars followed by an optional space, 1+ times
    )                   # stop capturing""", re.X)

Although I'm not sure why you're using regular expressions for this. Isn't str.split better?

s = 'aaa bbb ccc ddd'
result = s.split()[2:]

Upvotes: 1

Related Questions