Jens Madsen
Jens Madsen

Reputation: 1640

Regex, greedy quantifiers multiple capture groups

I would like to capture n words surrounding a word x without whitespaces. I need a capture group for each word. I can achieve this in the following way (here words after x):

import regex
n = 2
x = 'beef tomato chicken trump Madonna'
right_word = '\s+(\S+)'
regex_right = r'^\S*{}\s*'.format(n*right_word)
m_right = regex.search(regex_right, x)
print(m_right.groups())

so if x = 'beef tomato chicken trump Madonna', n = 2, regex_right = '^\S*\s+(\S+)\s+(\S+)\s*', and I get two capture groups containing 'tomato' and 'chicken'. However, if n=5 I capture nothing which is not the behavior I was looking for. For n = 5 I want to capture all words the right of 'beef'.

I have tried using the greedy quantifier

regex_right = r'^\S*(\s+\S+){,n}\s*'

but I only get a single group (the last word) no matter how many matches I get (furthermore I get the white spaces as well..).

I finally tried using regex.findall but I cannot limit it to n words but have to specify number of characters?

Can anyone help ?


Wiktor helped me(see below) thanks. However I have an additional problem

if x = 'beef, tomato, chicken, trump Madonna' I cannot figure out how to capture without the commas? I do not want groups as 'tomato,'

Upvotes: 2

Views: 2072

Answers (2)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 627292

You did not match all those words with the first approach because the pattern did not match the input string. You need to make the right_word pattern optional by enclosing it with (?:...)?:

import re
x = 'beef tomato chicken trump Madonna'
n = 5
right_word = '(?:\s+(\S+))?'
regex_right = r'^\S*{}'.format(n*right_word)
print(regex_right)
m_right = re.search(regex_right, x)
if m_right:
    print(m_right.groups())

See the Python demo.

The second approach will only work with PyPi regex module because Python re does not keep repeated captures, once a quantified capturing group matches a substring again within the same match iteration, its value is re-written.

>>> right_word = '\s+(\S+)'
>>> n = 5
>>> regex_right = r'^\S*(?:\s+(\S+)){{1,{0}}}'.format(n)
>>> result = [x.captures(1) for x in regex.finditer(regex_right, "beef tomato chicken trump Madonna")]
>>> result
[['tomato', 'chicken', 'trump', 'Madonna']]
>>> print(regex_right)
^\S*(?:\s+(\S+)){1,5}

Note that ^\S*(?:\s+(\S+)){1,5} has a capturing group #1 inside a quantified non-capturing group that is quantified with the {1,5} limiting quantifier, and since PyPi regex keeps track of all values captured with repeated capturing groups, they all are accessible via .captures(1) here. You can test this feature with a .NET regex tester:enter image description here

Upvotes: 6

Gawil
Gawil

Reputation: 1211

You got the correct approach. However regex can't do what you're asking for. Each time your capturing group captures another pattern, the previous content is replaced. That is why your capturing group only returns the last pattern captured.
You can easily match n words, but you can't capture them separately without writting each capture group explicitly.

Upvotes: 0

Related Questions