Reputation: 2172
I am trying to normalize a string by replacing the abbrevations that are used in it with their actual phrase equivalences. I have a list of such abbrevations in a python dictionary named "dict". For example:
print dict['gf']
would result in:
girlfriend
Now, my question is, since there are around 300 keys in this dictionary, I need a fast way to somehow check if any of these keys appear in a given string. My initial thought was to use the regular expression bellow and then somehow try to check and compare all the keys of the dictionary against all the words in a given string (which I have named it "text" in the code bellow), but I noticed that I can't place a variable in middle of a string.
import re
text = "I have a gf"
print re.sub (r'(?<![a-zA-Z])(gf)(?![a-zA-Z])', 'girlfriend', text)
This would print:
I have a girlfriend
But as you have noticed, I can't apply this method for the case described above. Can anyone help me with this? Thanks in advance!
Upvotes: 2
Views: 618
Reputation: 375574
Here's a way to construct a regex to match all the words at once:
words = {
'gf': 'girlfriend',
'bf': 'boyfriend',
'btw': 'by the way',
'hi': 'hello',
}
pat = re.compile(r"\b(%s)\b" % "|".join(words))
text = "The gf and the bf say hi btw."
new_text = pat.sub(lambda m: words.get(m.group()), text)
print new_text
Prints:
The girlfriend and the boyfriend say hello by the way.
Upvotes: 2
Reputation: 76715
You can use the .get()
method on the dictionary to look up an abbreviation. The default value returned by .get()
is None
, but you can provide an argument to be used when the lookup fails. So .get(s, s)
looks up s
in the dictionary, and returns s
unchanged if it wasn't in the dictionary, or returns the dictionary value if it was.
Then just split the string and lookup each word and rejoin.
abbrevs = { "gf" : "girlfriend", "cul" : "see you later" }
def lookup(s):
return abbrevs.get(s, s)
def expand(s_text):
return ' '.join(lookup(s) for s in s_text.split())
print(expand("My gf just called. cul"))
The above only splits words on white space, and replaces all white space with a single space. You could write a regular expression that matches white space and/or punctuation and use that to make a more clever splitting function, and you could save the matched white space to make it not replace all white space with a single space. But I wanted to keep the example simple.
Upvotes: 2