shantanuo
shantanuo

Reputation: 32316

Correct spellings using ML

I have a column with city names that may contain spelling mistakes.

try:
    from StringIO import StringIO
except ImportError:
    from io import StringIO

myst="""1 Mumbai
2 Delhi
3 Delhi
4 Mumbai
5 Mumbai
6 Delhi
7 Dolhi
"""
u_cols=['customer_id', 'city']

myf = StringIO(myst)
import pandas as pd
df = pd.read_csv(StringIO(myst), sep=' ', names = u_cols)

When I group by city, I can see the wrong spelling...

df['city'].value_counts()
Delhi     3
Mumbai    3
Dolhi     1

I can easily correct it using replace method like this...

df.replace({'Dolhi': 'Delhi'})['city'].value_counts()

But there is no way to know the possible spelling mistakes. If there are 2 similar words, then the one with higher count is considered correct. So in this case, "Delhi" is correct and "Dolhi" is wrong. Is it possible to apply any such "intelligence" to this dataframe data?

Upvotes: 0

Views: 151

Answers (1)

hbaderts
hbaderts

Reputation: 14316

You can calculate the Levenshtein distance between each pair of cities, which is a measure for the similarity of two strings. Basically, each change you have to make to convert a string into the other (e.g. changing characters, or adding/removing characters) will increase the Levenshtein distance by one. The nltk package contains an edit_distance() function which returns the Levenshtein distance.

You can iterate through all unique pairs of cities by using itertools.combinations(), and check whether the Levenshtein distance is below a certain magic threshold, e.g. 1 in this example. Note that this threshold requires some manual tuning in order to not merge different city names while still catching all spelling errors.

Then you can find out the correct name (main_name below) and the wrong name (mistake below) by comparing how often they were used in the dataset. Of course under the assumption that people will write the city name correctly most of the time.

import itertools
from nltk.metrics import edit_distance

MAGIC_THRESHOLD = 1

city_names = df['city'].value_counts()

for name_a, name_b in itertools.combinations(city_names.index, 2):
    if edit_distance(name_a, name_b) <= MAGIC_THRESHOLD:
        count_a, count_b = city_names[name_a], city_names[name_b]
        main_name, mistake = (name_a, name_b) if count_a > count_b else (name_b, name_a)
        df = df.replace({mistake: main_name})

print(df['city'].value_counts())

which gives us as a result:

Delhi     4
Mumbai    3

Upvotes: 2

Related Questions