Reputation: 4865
I am trying to create a regular expression that will take strings and break them up into three groups: (1) Any one of a specific list of words at the beginning of a string. (2) Any one of specific list of words at the end of a string. (3) all of the letters/whitespace in between these two matches.
As an example, I will use the following two strings:
'There was a cat in the house yesterday'
'Did you see a cat in the house today'
I would like the string to be broken up into capture groups so that the match object m.groups()
will return the following for each string respectively:
('There', ' was a cat in the house ', 'yesterday')
('Did', ' you see a cat in the house ', 'today')
Originally, I came up with the following regex:
r = re.compile('^(There|Did) ( |[A-Za-z])+ (today|yesterday)$')
However this returns:
('There', 'e', 'yesterday')
('Did', 'e', 'today')
So it's only giving me the last character matched in the middle group. I learned that this doesn't work because capture groups will only return the last iteration that matched. So I put parentheses around the middle capture group as follows:
r = re.compile('^(There|Did) (( |[A-Za-z])+) (today|yesterday)$')
But now, although it does at least capture the middle group, it is also returning an extra "e" character in m.groups()
, i.e.:
('There', 'was a cat in the house', 'e', 'yesterday')
... although I feel like this has something to do with backtracking, I can't figure out why it is happening. Could someone please explain to me why I am getting this result, and how I can get the desired results?
Upvotes: 3
Views: 1727
Reputation: 5515
r = re.compile('^(There|Did) (( |[A-Za-z])+) (today|yesterday)$')
^ ^ ^
you have some unnecessary stuff. Take those out and include spaces in your middle group:
r = re.compile('^(There|Did) ([A-Za-z ]+) (today|yesterday)$')
^ space
EXAMPLE:
>>> r = re.compile('^(There|Did) ([A-Za-z ]+) (today|yesterday)$')
>>> r.search('There was a a cat in the hosue yesterday').groups()
('There', 'was a a cat in the hosue', 'yesterday')
Also, take out the spaces in between your capture group if you want the spaces to be a part of your middle (2nd) group
Upvotes: 1
Reputation: 12587
A repeated capturing group will only capture the last iteration. Put a capturing group around the repeated group to capture all iterations or use a non-capturing group instead if you're not interested in the data.
source https://regex101.com/
And here is the re working as expected:
^(There|Did) ([ A-Za-z]+) (today|yesterday)$
Upvotes: 1
Reputation: 9633
You can simplify your current regex, and get the correct behavior, by replacing your middle capture group with the .
(dot) operator that will match any character, followed by the *
(asterisk) operator to repeatedly match any character:
import re
s1 = 'There was a cat in the house yesterday'
s2 = 'Did you see a cat in the house today'
x = re.compile("(There|Did)(.*)(today|yesterday)")
g1 = x.search(s1).groups()
g2 = x.search(s2).groups()
print(g1)
print(g2)
Produces this output:
('There', ' was a cat in the house ', 'yesterday')
('Did', ' you see a cat in the house ', 'today')
Upvotes: 1