ybcha204
ybcha204

Reputation: 91

Python regex non-greedy acting like greedy

I am working with transcripts and having trouble with matching patterns in non-greedy fashion. It is still grabbing way too much and looks like doing greedy matches.

A transcript looks like this:

>> John doe: Hello, I am John Doe.

>> Hello, I am Jane Doe.

>> Thank you for coming, we will start in two minutes.

>> Sam Smith: [no audio] Good morning, everyone.

To find the name of speakers within >> (WHATEVER NAME):, I wrote

pattern=re.compile(r'>>(.*?):')
transcript='>> John doe: Hello, I am John Doe. >> Hello, I am Jane Doe. >> Thank you for coming, we will start in two minutes. >> Sam Smith: [no audio] Good morning, everyone.'
re.findall(pattern, transcript)

I expected 'John Doe' and 'Sam Smith', but it is giving me 'John Doe' and 'Hello, I am Jane Doe. >> Thank you for coming, we will start in two minutes. >> Sam Smith'

I am confused because .*? is non-greedy, which (I think) should be able to grab 'Sam Smith'. How should I fix the code so that it only grabs whatever in >> (WHATEVER NAME):? Also, I am using Python 3.6.

Thanks!

Upvotes: 3

Views: 242

Answers (2)

user3483203
user3483203

Reputation: 51155

Your understanding of a non-greedy regex is slightly off. Non-greedy means it will match the shortest match possible from when it begins matching. It will not change the character it begins matching from if another one is found in the match.

For example:

start.*?stop

Will match all of startstartstop, because once it starts matching at start it will keep matching until it finds stop. Non-greedy simply means that for the string startstartstopstop, it would only match up until the first stop.

For your question, this is an easy problem to solve using positive lookahead.

You may use >> ([a-zA-Z ]+)(?=:):

>>> transcript='>> John doe: Hello, I am John Doe. >> Hello, I am Jane Doe. >> Thank you for coming, we will start in two minutes. >> Sam Smith: [no audio] Good morning, everyone.'    
>>> re.findall(r'>> ([a-zA-Z ]+)(?=:)', transcript)
['John doe', 'Sam Smith']

Upvotes: 2

cs95
cs95

Reputation: 402563

Do you really need regex? You can split on >> prompts and then filter out your names.

>>> [i.split(':')[0].strip() for i in transcript.split('>>') if ':' in i]
['John doe', 'Sam Smith']

Upvotes: 4

Related Questions