Reputation: 937
I'm trying to remove the dots of a list of abbreviations so that they will not confuse the sentence tokenizer. This is should be very straightforward. Don't know why my code is not working.
Below please find my code:
abbrevs = [
"No.", "U.S.", "Mses.", "B.S.", "B.A.", "D.C.", "B.Tech.", "Pte.", "Mr.", "O.E.M.",
"I.R.S", "sq.", "Reg.", "S-K."
]
def replace_abbrev(abbrs, text):
re_abbrs = [r"\b" + re.escape(a) + r"\b" for a in abbrs]
abbr_no_dot = [a.replace(".", "") for a in abbrs]
pattern_zip = zip(re_abbrs, abbr_no_dot)
for p in pattern_zip:
text = re.sub(p[0], p[1], text)
return text
text = "Test No. U.S. Mses. B.S. Test"
text = replace_abbrev(abbrevs, text)
print(text)
Here is the result. Nothing happened. What was wrong? Thanks.
Test No. U.S. Mses. B.S. Test
Upvotes: 2
Views: 302
Reputation: 27515
You could use map and operator.methodcaller no need for re even though it's a great library.
from operator import methodcaller
' '.join(map(methodcaller('replace', '.', ''), abbrevs))
#No US Mses BS BA DC BTech Pte Mr OEM IRS sq Reg S-K
Upvotes: 1
Reputation: 67988
re_abbrs = [r"\b" + re.escape(a) for a in abbrs]
You need to use this.There is no \b
after .
.This gives the correct output.
Test No US Mses BS Test
Upvotes: 3