ebuzz168
ebuzz168

Reputation: 1194

Dealing with abbreviation and misspelled words in DataFrame Pandas

I have a dataframe contains misspelled words and abbreviations like this.

input:
df = pd.DataFrame(['swtch', 'cola', 'FBI', 
      'smsng', 'BCA', 'MIB'], columns=['misspelled'])

output:
       misspelled
0   swtch
1   cola
2   FBI
3   smsng
4   BCA
5   MIB

I need to correcting the misspelled words and the Abvreviations

I have tried with creating the dictionary such as:

input: 
dicts = pd.DataFrame(['coca cola', 'Federal Bureau of Investigation', 
                    'samsung', 'Bank Central Asia', 'switch', 'Men In Black'], columns=['words'])

output:
        words
0   coca cola
1   Federal Bureau of Investigation
2   samsung
3   Bank Central Asia
4   switch
5   Men In Black 

and applying this code

x = [next(iter(x), np.nan) for x in map(lambda x: difflib.get_close_matches(x, dicts.words), df.misspelled)]
df['fix'] = x

print (df)

The output is I have succeded correcting misspelled but not the abbreviation

misspelled        fix
0      swtch     switch
1       cola  coca cola
2        FBI        NaN
3      smsng    samsung
4        BCA        NaN
5        MIB        NaN

Please help.

Upvotes: 3

Views: 2087

Answers (1)

Code Different
Code Different

Reputation: 93161

How about following a 2-prong approach where first correct the misspellings and then expand the abbreviations:

df = pd.DataFrame(['swtch', 'cola', 'FBI', 'smsng', 'BCA', 'MIB'], columns=['misspelled'])
abbreviations = {
    'FBI': 'Federal Bureau of Investigation',
    'BCA': 'Bank Central Asia',
    'MIB': 'Men In Black',
    'cola': 'Coca Cola'
}

spell = SpellChecker()
df['fixed'] = df['misspelled'].apply(spell.correction).replace(abbreviations)

Result:

  misspelled                            fixed
0      swtch                           switch
1       cola                        Coca Cola
2        FBI  Federal Bureau of Investigation
3      smsng                            among
4        BCA                Bank Central Asia
5        MIB                     Men In Black

I use pyspellchecker but you can go with any spelling-checking library. It corrected smsng to among but that is a caveat of automatic spelling correction. Different libraries may give out different results

Upvotes: 2

Related Questions