Reputation: 18929
I am developing an application where I need to search and substitute strings in a body of text.
I came across this SO post and have been using the 3rd answer as the basis of my function.
My code looks like:
subs_dict = {
"INT.": "Internet",
...
}
def substitutions(self, text):
return re.sub(
r'\b' + '|'.join(subs_dict.keys())
+ r'\b', lambda m: subs_dict[m.group(0)],
text
)
However, this gets tripped up by text such as "The INTREPID explorer"
which fails with a Key Error
for INTR
.
The problem is that in the comparison "INT." gets interpreted as "INT" and any other character as the period is special.
I have temporarily fixed the issue using this modified code:
def substitutions(text):
return re.sub(
r'\b' + '|'.join(subs_dict.keys()).replace('.', [.])
+ r'\b', lambda m: subs_dict[m.group(0)],
text
)
Which allows the period to be evaluated literally but maintains the integrity of the dict keys (as opposed to using "INT[.]" as the key which will fail.
However, this has a bad smell to it and of course only takes care of the period, and not any other special characters.
So, I guess my question would be if there is a better way which works and evaluates any special characters literally.
Upvotes: 2
Views: 37
Reputation: 239573
The cleaner way would be to escape the actual strings, with re.escape
before you join them, like this
r'\b' + '|'.join(map(re.escape, subs_dict)) + r'\b'
For example,
>>> import re
>>> subs_dict = {"INT.": "Internet"}
>>> def substitutions(text):
... return re.sub(r'\b' + '|'.join(map(re.escape, subs_dict)) + r'\b',
... lambda m: subs_dict[m.group(0)], text)
...
>>> substitutions("The INTREPID explorer")
'The INTREPID explorer'
>>> substitutions("The INT.EPID explorer")
'The InternetEPID explorer'
Upvotes: 2