Victor Wang
Victor Wang

Reputation: 937

Replacing the dots for a list of abbreviations?

I'm trying to remove the dots of a list of abbreviations so that they will not confuse the sentence tokenizer. This is should be very straightforward. Don't know why my code is not working.

Below please find my code:

abbrevs = [
    "No.", "U.S.", "Mses.", "B.S.", "B.A.", "D.C.", "B.Tech.", "Pte.", "Mr.", "O.E.M.",
    "I.R.S", "sq.", "Reg.", "S-K."
]



def replace_abbrev(abbrs, text):
    re_abbrs = [r"\b" + re.escape(a) + r"\b" for a in abbrs]

    abbr_no_dot = [a.replace(".", "") for a in abbrs]

    pattern_zip = zip(re_abbrs, abbr_no_dot)

    for p in pattern_zip:
        text = re.sub(p[0], p[1], text)

    return text

text = "Test No. U.S. Mses. B.S. Test"

text = replace_abbrev(abbrevs, text)

print(text)

Here is the result. Nothing happened. What was wrong? Thanks.

 Test No. U.S. Mses. B.S. Test

Upvotes: 2

Views: 302

Answers (2)

Jab
Jab

Reputation: 27515

You could use map and operator.methodcaller no need for re even though it's a great library.

from operator import methodcaller

' '.join(map(methodcaller('replace', '.', ''), abbrevs))
#No US Mses BS BA DC BTech Pte Mr OEM IRS sq Reg S-K

Upvotes: 1

vks
vks

Reputation: 67988

re_abbrs = [r"\b" + re.escape(a)  for a in abbrs]

You need to use this.There is no \b after . .This gives the correct output.

Test No US Mses BS Test

Upvotes: 3

Related Questions