Python regex susbstitution function to evaluate literal characters

Question

I am developing an application where I need to search and substitute strings in a body of text.

I came across this SO post and have been using the 3rd answer as the basis of my function.

My code looks like:

subs_dict = {
    "INT.": "Internet",
    ...
}

def substitutions(self, text):
    return re.sub(
        r'\b' + '|'.join(subs_dict.keys())
        + r'\b', lambda m: subs_dict[m.group(0)],
        text
    )

However, this gets tripped up by text such as "The INTREPID explorer" which fails with a Key Error for INTR.

The problem is that in the comparison "INT." gets interpreted as "INT" and any other character as the period is special.

I have temporarily fixed the issue using this modified code:

def substitutions(text):
    return re.sub(
        r'\b' + '|'.join(subs_dict.keys()).replace('.', [.])
        + r'\b', lambda m: subs_dict[m.group(0)],
        text
    )

Which allows the period to be evaluated literally but maintains the integrity of the dict keys (as opposed to using "INT[.]" as the key which will fail.

However, this has a bad smell to it and of course only takes care of the period, and not any other special characters.

So, I guess my question would be if there is a better way which works and evaluates any special characters literally.

thefourtheye · Accepted Answer

The cleaner way would be to escape the actual strings, with re.escape before you join them, like this

r'\b' + '|'.join(map(re.escape, subs_dict)) + r'\b'

For example,

>>> import re
>>> subs_dict = {"INT.": "Internet"}
>>> def substitutions(text):
...     return re.sub(r'\b' + '|'.join(map(re.escape, subs_dict)) + r'\b',
...                   lambda m: subs_dict[m.group(0)], text)
... 
>>> substitutions("The INTREPID explorer")
'The INTREPID explorer'
>>> substitutions("The INT.EPID explorer")
'The InternetEPID explorer'

Python regex susbstitution function to evaluate literal characters

Answers (1)

Related Questions