Reputation: 3046
I have a string like this,
my_str ='·in this match, dated may 1, 2013 (the "the match") is between brooklyn centenniel, resident of detroit, michigan ("champion") and kamil kubaru, the challenger from alexandria, virginia ("underdog").'
Now, I want to extract the current champion
and the underdog
using keywords champion
and underdog
.
What is really challenging here is both contender's names appear before the keyword inside parenthesis. I want to use regular expression and extract information.
Following is what I did,
champion = re.findall(r'("champion"[^.]*.)', my_str)
print(champion)
>> ['"champion") and kamil kubaru, the challenger from alexandria, virginia ("underdog").']
underdog = re.findall(r'("underdog"[^.]*.)', my_str)
print(underdog)
>>['"underdog").']
However, I need the results, champion as
:
brooklyn centenniel, resident of detroit, michigan
and the underdog
as:
kamil kubaru, the challenger from alexandria, virginia
How can I do this using regular expression? (I have been searching, if I could go back couple or words from the keyword to get the result I want, but no luck yet) Any help or suggestion would be appreciated.
Upvotes: 2
Views: 1261
Reputation: 680
There will be a better answer than this, and I don't know regex at all, but I'm bored, so here's my 2 cents.
Here's how I would go about it:
words = my_str.split()
index = words.index('("champion")')
champion = words[index - 6:index]
champion = " ".join(champion)
for the underdog, you will have to change the 6 to a 7, and '("champion")'
to '("underdog").'
Not sure if this will solve your problem, but for this particular string, this worked when I tested it.
You could also use str.strip() to remove punctuation if that trailing period on underdog is a problem.
Upvotes: 0
Reputation: 42137
You can use named captured group to capture the desired results:
between\s+(?P<champion>.*?)\s+\("champion"\)\s+and\s+(?P<underdog>.*?)\s+\("underdog"\)
between\s+(?P<champion>.*?)\s+\("champion"\)
matches the chunk from between
to ("champion")
and put the desired portion in between as the named captured group champion
After that, \s+and\s+(?P<underdog>.*?)\s+\("underdog"\)
matches the chunk upto ("underdog")
and again get the desired portion from here as named captured group underdog
Example:
In [26]: my_str ='·in this match, dated may 1, 2013 (the "the match") is between brooklyn centenniel, resident of detroit, michigan ("champion") and kamil kubaru, the challenger from alexandria, virginia
...: ("underdog").'
In [27]: out = re.search(r'between\s+(?P<champion>.*?)\s+\("champion"\)\s+and\s+(?P<underdog>.*?)\s+\("underdog"\)', my_str)
In [28]: out.groupdict()
Out[28]:
{'champion': 'brooklyn centenniel, resident of detroit, michigan',
'underdog': 'kamil kubaru, the challenger from alexandria, virginia'}
Upvotes: 1