Reputation: 27
My string is a transcript, I want to capture the speaker, specifically their second name (Which needs to only match when fully capitalised) Additionally, I want to match their speech until the next speaker begins, I want to loop this process over a huge text file eventually.
The problem is the match only returns one match object, even though there are two different speakers. Also I have tried online regex tester with the python flavor however, they return very different results (not sure why?).
str = 'Senator BACK\n (Western Australia) (21:15): This evening I had the pleasure (...) Senator DAY\n (South Australia) (21:34): Well, what a week it h(...) '
pattern = re.compile("(:?(Senator|Mr|Dr)\s+([A-Z]{2,})\s*(\(.+?\))\s+(\(\d{2}:\d{2}\):)(.*))(?=Senator)")
for match in re.finditer(pattern, str):
print(match)
I want 2 match objects, both objects having a group for there surname and their speech. It's important to note also I have used Regex debuggers online however the python flavor gives different results to Python on my terminal.
Upvotes: 1
Views: 775
Reputation: 12438
Just replace the regex into:
(:?(Senator|Mr|Dr)\s+([A-Z]{2,})\s*(\(.+?\))\s+(\(\d{2}:\d{2}\):)(.*))(?=Senator|$)
demo: https://regex101.com/r/gJDaWM/1/
With your current regex, you are enforcing the condition that each match must be followed by Senator
through the positive lookahead.
You might actually have to change the positive lookahead into:
(?=Senator|Mr|Dr|$)
if you want to take into account Mr
and Dr
on top of Senator
.
Upvotes: 1