Reputation: 1395
Learning regex for python. I want to thank Jerry for his initial help on this problem. I tested this regex:
(\b\w+\b)?[^a-z]*(\b\w+\b)?[^a-z]*(\b\w+\b)?[^a-z]*(\b\w+\b)?[,;]\s*(\b\w+\b)?[^a-z]*(\b\w+\b)?[^a-z]*(\b\w+\b)?[^a-z]*(\b\w+\b)?
at http://regex101.com/ and it finds what I am looking for, which is the four words that come before a comma in a sentence and the four words after a comma. If there are three two words before the comma at the beginning of the sentence it cannot crash. The test sentence I am using is:
waiting for coffee, waiting for coffee and the charitable crumb.
right now the regex returns:
[('waiting', 'for', 'coffee', '', 'waiting', 'for', 'coffee', 'and')]
I can't quite understand why the fourth member of the set is empty. What I want is for the regex to only return the 3 before the comma and the 4 after the comma in this instance, but in the event that there are four words before the comma, I want it to return four. I know that regex varies between languages, is this something I am missing in python?
Upvotes: 1
Views: 5609
Reputation: 1122022
You have optional groups:
(\bw+\b)?
The question mark makes that an optional match. But Python will always return all groups in the pattern, and for any group that didn't match anything, an empty value (None
, usually) is returned instead:
>>> import re
>>> example = 'waiting for coffee, waiting for coffee and the charitable crumb.'
>>> pattern = re.compile(r'(\b\w+\b)?[^a-z]*(\b\w+\b)?[^a-z]*(\b\w+\b)?[^a-z]*(\b\w+\b)?[,;]\s*(\b\w+\b)?[^a-z]*(\b\w+\b)?[^a-z]*(\b\w+\b)?[^a-z]*(\b\w+\b)?')
>>> pattern.search(example).groups()
('waiting', 'for', 'coffee', None, 'waiting', 'for', 'coffee', 'and')
Note the None
in the output, that's the 4th word-group before the comma not matching anything because there are only 3 words to match. You must've used .findall()
, which explicitly returns strings, and the pattern group that didn't match is thus represented as an empty string instead.
Remove the question marks, and your pattern won't match your input example until you add that required 4th word before the comma:
>>> pattern_required = re.compile(r'(\b\w+\b)[^a-z]*(\b\w+\b)[^a-z]*(\b\w+\b)[^a-z]*(\b\w+\b)[,;]\s*(\b\w+\b)[^a-z]*(\b\w+\b)[^a-z]*(\b\w+\b)[^a-z]*(\b\w+\b)')
>>> pattern_required.findall(example)
[]
>>> pattern_required.findall('Not ' + example)
[('Not', 'waiting', 'for', 'coffee', 'waiting', 'for', 'coffee', 'and')]
If you need to match between 2 and 4 words, but do not want empty groups, you'll have to make one group match multiple words. You cannot have a variable number of groups, regular expressions do not work like that.
Matching multiple words in one group:
>>> pattern_variable = re.compile(r'(\b\w+\b)[^a-z]*((?:\b\w+\b[^a-z]*){1,3})[,;]\s*(\b\w+\b)[^a-z]*(\b\w+\b)[^a-z]*(\b\w+\b)[^a-z]*(\b\w+\b)')
>>> pattern_variable.findall(example)
[('waiting', 'for coffee', 'waiting', 'for', 'coffee', 'and')]
>>> pattern_variable.findall('Not ' + example)
[('Not', 'waiting for coffee', 'waiting', 'for', 'coffee', 'and')]
Here the (?:...)
syntax creates a non-capturing group, one that does not produce output in the .findall()
list; used here so we can put a quantifier on it. {1,3}
tells the regular expression we want the preceding group to be matched between 1 and 3 times.
Note the output; the second group contains a variable number of words (between 1 and 3).
Upvotes: 4
Reputation: 142156
Since you've got an answer as to how to sort out your regex, I'd point out that in Python - stuff like this is normally much more easily done, and readable via using builtin string functions, eg:
s = 'waiting for coffee, waiting for coffee and the charitable crumb.'
before, after = map(str.split, s.partition(',')[::2])
print before[-4:], after[:4]
# ['waiting', 'for', 'coffee'] ['waiting', 'for', 'coffee', 'and']
Upvotes: 2
Reputation: 21914
When you've already got a regex that's that long and convoluted I highly suggest you don't try fixing your problem by adding more regex. It will only end in tears. If you want to get rid of that empty group I would consider just running:
filter(None, regex_return)
On the answer you get back.
For example:
test = ('waiting', 'for', 'coffee', '', 'waiting', 'for', 'coffee', 'and')
print filter(None, test)
>>> ('waiting', 'for', 'coffee', 'waiting', 'for', 'coffee', 'and')
Which I believe does what you want.
Upvotes: 0