Reputation: 237
I'm attempting to do this:
p = re.compile(ur'([A-Z]\w+\s+[A-Z]\w+)|([A-Z]\w+)(?=\s+and\s+[A-Z]\w+\s+([A-Z]\w+))', re.MULTILINE)
test_str = u"Russ Middleton and Lisa Murro\nRon Iervolino, Trish and Russ Middleton, and Lisa Middleton \nRon Iervolino, Kelly and Tom Murro\nRon Iervolino, Trish and Russ Middleton and Lisa Middleton "
subst = u"$1$2 $3"
result = re.sub(p, subst, test_str)
The goal is to get something that both matches all the names and fills in last names when necessary (e.g., Trish and Russ Middleton becomes Trish Middleton and Russ Middleton). In the end, I'm looking for the names that appear together in a single line.
Someone else was kind enough to help me with the regex, and I thought I knew how to write it programmatically in Python (although I'm new to Python). Not being able to get it, I resorted to using the code generated by Regex101 (the code shown above). However, all I get in result
is:
u'$1$2 $3 and $1$2 $3\n$1$2 $3, $1$2 $3 and $1$2 $3, and $1$2 $3 \n$1$2 $3, $1$2 $3 and $1$2 $3\n$1$2 $3, $1$2 $3 and $1$2 $3 and $1$2 $3 '
What am I missing with Python and regular expressions?
Upvotes: 0
Views: 132
Reputation: 237
Alex: I see what you're saying about the groups. That didn't occur to me. Thanks!
I took a fresh (ish) approach. This appears to be working. Any thoughts on it?
p = re.compile(ur'([A-Z]\w+\s+[A-Z]\w+)|([A-Z]\w+)(?=\s+and\s+[A-Z]\w+\s+([A-Z]\w+))', re.MULTILINE)
temp_result = p.findall(s)
joiner = " ".join
out = [joiner(words).strip() for words in temp_result]
Here is some input data:
test_data = ['John Smith, Barri Lieberman, Nancy Drew','Carter Bays and Craig Thomas','John Smith and Carter Bays',
'Jena Silverman, John Silverman, Tess Silverman, and Dara Silverman', 'Tess and Dara Silverman',
'Nancy Drew, John Smith, and Daniel Murphy', 'Jonny Podell']
I put the code above in a function so I could call it on every item in the list. Calling it on the list above, I get as output (from the function) this:
['John Smith', 'Barri Lieberman', 'Nancy Drew']
['Carter Bays', 'Craig Thomas']
['John Smith', 'Carter Bays']
['Jena Silverman', 'John Silverman', 'Tess Silverman', 'Dara Silverman']
['Tess Silverman', 'Dara Silverman']
['Nancy Drew', 'John Smith', 'Daniel Murphy']
['Jonny Podell']
Upvotes: 0
Reputation: 174756
I suggest you a simple solution.
import re
string = """Russ Middleton and Lisa Murro
Ron Iervolino, Trish and Russ Middleton, and Lisa Middleton
Ron Iervolino, Kelly and Tom Murro
Ron Iervolino, Trish and Russ Middleton and Lisa Middleton """
m = re.sub(r'(?<=,\s)([A-Z]\w+)(?=\s+and\s+[A-Z]\w+\s+([A-Z]\w+))', r'\1 \2', string)
print(m)
Output:
Russ Middleton and Lisa Murro
Ron Iervolino, Trish Middleton and Russ Middleton, and Lisa Middleton
Ron Iervolino, Kelly Murro and Tom Murro
Ron Iervolino, Trish Middleton and Russ Middleton and Lisa Middleton
OR
import regex
string = """Russ Middleton and Lisa Murro
Ron Iervolino, Trish and Russ Middleton, and Lisa Middleton
Ron Iervolino, Kelly and Tom Murro
Ron Iervolino, Trish and Russ Middleton and Lisa Middleton
Trish and Russ Middleton"""
m = regex.sub(r'(?<!\b[A-Z]\w+\s)([A-Z]\w+)(?=\s+and\s+[A-Z]\w+\s+([A-Z]\w+))', r'\1 \2', string)
print(m)
Output:
Russ Middleton and Lisa Murro
Ron Iervolino, Trish Middleton and Russ Middleton, and Lisa Middleton
Ron Iervolino, Kelly Murro and Tom Murro
Ron Iervolino, Trish Middleton and Russ Middleton and Lisa Middleton
Trish Middleton and Russ Middleton
Upvotes: 0
Reputation: 882023
You're not using the right syntax for subst
-- try, rather
subst = r'\1\2 \3'
However, now you have the problem there aren't three matched groups in the matches.
Specifically:
>>> for x in p.finditer(test_str): print(x.groups())
...
('Russ Middleton', None, None)
('Lisa Murro', None, None)
('Ron Iervolino', None, None)
(None, 'Trish', 'Middleton')
('Russ Middleton', None, None)
('Lisa Middleton', None, None)
('Ron Iervolino', None, None)
(None, 'Kelly', 'Murro')
('Tom Murro', None, None)
('Ron Iervolino', None, None)
(None, 'Trish', 'Middleton')
('Russ Middleton', None, None)
('Lisa Middleton', None, None)
whenever you see a None
here, it will be an error to try and interpolate the corresponding group (\1
, etc) in a substitution.
A function can be more flexible:
>>> def mysub(mo):
... return '{}{} {}'.format(
... mo.group(1) or '',
... mo.group(2) or '',
... mo.group(3) or '')
...
>>> result = re.sub(p, mysub, test_str)
>>> result
'Russ Middleton and Lisa Murro \nRon Iervolino , Trish Middleton and Russ Middleton , and Lisa Middleton \nRon Iervolino , Kelly Murro and Tom Murro \nRon Iervolino , Trish Middleton and Russ Middleton and Lisa Middleton '
Here, I've coded mysub
to do what I suspect you thought a substitution string with group numbers would do for you -- use an empty string where a group did not match (i.e, the corresponding mo.group(...)
is None
).
Upvotes: 1