user1995
user1995

Reputation: 538

%s showing strange behavior in regex

I have a string in which I want to find some words preceding a parenthesis. Lets say the string is -

'there are many people in the world having colorectal cancer (crc) who also have the depression syndrome (ds)'

I want to capture at most 5 words before a parenthesis. I have a list acronym_list of abbreviations which are inside the brackets - [(crc), (ds)]. So I am using the following code -

acrolen=5
rt=[]
for acro in acronym_list:
    find_words= re.findall('((?:\w+\W+){1,%d}%s)'  %(acrolen, acro), text, re.I)
    for word in find_words:
            rt.append(word)
print rt

But this gives this result -

('the world having colorectal cancer (crc', 'crc')
('also have the depression syndrome (ds', 'ds')

Whereas if I use the regex -

find_words= re.findall('((?:\w+\W+){1,%d}\(crc\))' %(acrolen),s, re.I)

Then it is able to find exactly what I want i.e. -

the world having colorectal cancer (crc)

The question is - why using %s for the string here causing the regex match to be so vastly different (having unnecessary brackets around it, repeating the acronym etc..)

How can I use the 1st regex properly so that I can automate the process using a loop rather than having to enter the exact string every time in the regex ?

Upvotes: 1

Views: 56

Answers (1)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626747

You need to make sure the variables you pass are escaped correctly so as to be used as literal text inside a regex pattern. Use re.escape(acro):

import re
text = "there are many people in the world having colorectal cancer (crc) who also have the depression syndrome (ds)"
acrolen=5
rt=[]
acronym_list = ["(crc)", "(ds)"]
for acro in acronym_list:
    p = r'((?:\w+\W+){1,%d}%s)' %(acrolen, re.escape(acro))
    # Or, use format:
    # p = r'((?:\w+\W+){{1,{0}}}{1})'.format(acrolen, re.escape(acro))
    find_words= re.findall(p, text, re.I)
    for word in find_words:
        rt.append(word)
print rt

See the Python demo

Also, note you do not need to enclose the whole pattern with a capturing group, re.findall will return match values if no capturing group is defined in the pattern.

It is also recommended to use raw string literals when defining regex patterns to avoid ambiguous situations.

Upvotes: 1

Related Questions