Regex, greedy quantifiers multiple capture groups

Question

I would like to capture n words surrounding a word x without whitespaces. I need a capture group for each word. I can achieve this in the following way (here words after x):

import regex
n = 2
x = 'beef tomato chicken trump Madonna'
right_word = '\s+(\S+)'
regex_right = r'^\S*{}\s*'.format(n*right_word)
m_right = regex.search(regex_right, x)
print(m_right.groups())

so if x = 'beef tomato chicken trump Madonna', n = 2, regex_right = '^\S*\s+(\S+)\s+(\S+)\s*', and I get two capture groups containing 'tomato' and 'chicken'. However, if n=5 I capture nothing which is not the behavior I was looking for. For n = 5 I want to capture all words the right of 'beef'.

I have tried using the greedy quantifier

regex_right = r'^\S*(\s+\S+){,n}\s*'

but I only get a single group (the last word) no matter how many matches I get (furthermore I get the white spaces as well..).

I finally tried using regex.findall but I cannot limit it to n words but have to specify number of characters?

Can anyone help ?

Wiktor helped me(see below) thanks. However I have an additional problem

if x = 'beef, tomato, chicken, trump Madonna' I cannot figure out how to capture without the commas? I do not want groups as 'tomato,'

Wiktor Stribiżew · Accepted Answer

You did not match all those words with the first approach because the pattern did not match the input string. You need to make the right_word pattern optional by enclosing it with (?:...)?:

import re
x = 'beef tomato chicken trump Madonna'
n = 5
right_word = '(?:\s+(\S+))?'
regex_right = r'^\S*{}'.format(n*right_word)
print(regex_right)
m_right = re.search(regex_right, x)
if m_right:
    print(m_right.groups())

See the Python demo.

The second approach will only work with PyPi regex module because Python re does not keep repeated captures, once a quantified capturing group matches a substring again within the same match iteration, its value is re-written.

>>> right_word = '\s+(\S+)'
>>> n = 5
>>> regex_right = r'^\S*(?:\s+(\S+)){{1,{0}}}'.format(n)
>>> result = [x.captures(1) for x in regex.finditer(regex_right, "beef tomato chicken trump Madonna")]
>>> result
[['tomato', 'chicken', 'trump', 'Madonna']]
>>> print(regex_right)
^\S*(?:\s+(\S+)){1,5}

Note that ^\S*(?:\s+(\S+)){1,5} has a capturing group #1 inside a quantified non-capturing group that is quantified with the {1,5} limiting quantifier, and since PyPi regex keeps track of all values captured with repeated capturing groups, they all are accessible via .captures(1) here. You can test this feature with a .NET regex tester:

Regex, greedy quantifiers multiple capture groups

Answers (2)

Related Questions