Reputation: 91
I am working with transcripts and having trouble with matching patterns in non-greedy fashion. It is still grabbing way too much and looks like doing greedy matches.
A transcript looks like this:
>> John doe: Hello, I am John Doe.
>> Hello, I am Jane Doe.
>> Thank you for coming, we will start in two minutes.
>> Sam Smith: [no audio] Good morning, everyone.
To find the name of speakers within >> (WHATEVER NAME):, I wrote
pattern=re.compile(r'>>(.*?):')
transcript='>> John doe: Hello, I am John Doe. >> Hello, I am Jane Doe. >> Thank you for coming, we will start in two minutes. >> Sam Smith: [no audio] Good morning, everyone.'
re.findall(pattern, transcript)
I expected 'John Doe'
and 'Sam Smith'
, but it is giving me 'John Doe'
and 'Hello, I am Jane Doe. >> Thank you for coming, we will start in two minutes. >> Sam Smith'
I am confused because .*?
is non-greedy, which (I think) should be able to grab 'Sam Smith'
. How should I fix the code so that it only grabs whatever in
>> (WHATEVER NAME):? Also, I am using Python 3.6.
Thanks!
Upvotes: 3
Views: 242
Reputation: 51155
Your understanding of a non-greedy regex is slightly off. Non-greedy means it will match the shortest match possible from when it begins matching. It will not change the character it begins matching from if another one is found in the match.
For example:
start.*?stop
Will match all of startstartstop
, because once it starts matching at start
it will keep matching until it finds stop. Non-greedy simply means that for the string startstartstopstop
, it would only match up until the first stop.
For your question, this is an easy problem to solve using positive lookahead.
You may use >> ([a-zA-Z ]+)(?=:)
:
>>> transcript='>> John doe: Hello, I am John Doe. >> Hello, I am Jane Doe. >> Thank you for coming, we will start in two minutes. >> Sam Smith: [no audio] Good morning, everyone.'
>>> re.findall(r'>> ([a-zA-Z ]+)(?=:)', transcript)
['John doe', 'Sam Smith']
Upvotes: 2
Reputation: 402563
Do you really need regex? You can split on >>
prompts and then filter out your names.
>>> [i.split(':')[0].strip() for i in transcript.split('>>') if ':' in i]
['John doe', 'Sam Smith']
Upvotes: 4