Reputation: 23
I'm working on cleaning some text that contains a lot of acronyms. So I have made a dictionary of a few examples and along with their values, however i am running into a few problems with it. Example code below
def acr(text):
acr_dict = {'ft': 'feet'
'mi': michigan }
for word, abr in acr_dict.items():
text = text.replace(word.lower(), abr)
return text
The logic works, but if I have an instance where the letters of the acronym could also be found in certain other words, it will do the following
ex: print(acr('I like milk and live in mi))
output --> I like michiganlk and live in michigan
Any advice on how to not have it look for the acronym letters within other words?
Upvotes: 2
Views: 1007
Reputation: 4957
One potential solution (assuming you have trivial white space) could be to split the string into words, and compare each one and replace if it matches.
example = "my name is michael and i was born in mi and am 6 ft"
def acr(text):
acr_dict = {
'ft': 'feet',
'mi': 'michigan'
}
text_words = text.split()
for i, word in enumerate(text_words):
if word.lower() in acr_dict:
text_words[i] = acr_dict[word]
return ' '.join(text_words)
print(acr(example))
# my name is michael and i was born in michigan and am 6 feet
And if you did have non-trivial white space and were okay using regular expressions, you could do this which should preserve the specific white space character,
import re
def acr(text):
acr_dict = {
'ft': 'feet',
'mi': 'michigan'
}
for k, v in acr_dict.items():
text = re.sub(rf"(\s){k.lower()}(\s|\Z)", rf"\1{v}\2", text)
return text
If you were worried about performance, you could try compiling each regex for your acronym list before hand.
Upvotes: 1
Reputation: 488
The simplest solution is, as others have stated, to use regexes.
import re
ACR_DICT = {'ft': 'feet', 'mi': 'michigan'}
def acr(text):
for k, v in ACR_DICT.items():
text = re.sub(rf'\b{k}\b', v, text)
return text
acr('I might be 6 ft tall. I often left my home state of mi at 3 years old.')
# 'I might be 6 feet tall. I often left my home state of michigan at 3 years old.'
Note the usage of the word-boundary metacharacter '\b'. This will ensure that the regex doesn't find matches inside words like 'often' or 'might'.
Upvotes: 3