Reputation: 32316
I have a column with city names that may contain spelling mistakes.
try:
from StringIO import StringIO
except ImportError:
from io import StringIO
myst="""1 Mumbai
2 Delhi
3 Delhi
4 Mumbai
5 Mumbai
6 Delhi
7 Dolhi
"""
u_cols=['customer_id', 'city']
myf = StringIO(myst)
import pandas as pd
df = pd.read_csv(StringIO(myst), sep=' ', names = u_cols)
When I group by city, I can see the wrong spelling...
df['city'].value_counts()
Delhi 3
Mumbai 3
Dolhi 1
I can easily correct it using replace method like this...
df.replace({'Dolhi': 'Delhi'})['city'].value_counts()
But there is no way to know the possible spelling mistakes. If there are 2 similar words, then the one with higher count is considered correct. So in this case, "Delhi" is correct and "Dolhi" is wrong. Is it possible to apply any such "intelligence" to this dataframe data?
Upvotes: 0
Views: 151
Reputation: 14316
You can calculate the Levenshtein distance between each pair of cities, which is a measure for the similarity of two strings. Basically, each change you have to make to convert a string into the other (e.g. changing characters, or adding/removing characters) will increase the Levenshtein distance by one. The nltk
package contains an edit_distance()
function which returns the Levenshtein distance.
You can iterate through all unique pairs of cities by using itertools.combinations()
, and check whether the Levenshtein distance is below a certain magic threshold, e.g. 1
in this example. Note that this threshold requires some manual tuning in order to not merge different city names while still catching all spelling errors.
Then you can find out the correct name (main_name
below) and the wrong name (mistake
below) by comparing how often they were used in the dataset. Of course under the assumption that people will write the city name correctly most of the time.
import itertools
from nltk.metrics import edit_distance
MAGIC_THRESHOLD = 1
city_names = df['city'].value_counts()
for name_a, name_b in itertools.combinations(city_names.index, 2):
if edit_distance(name_a, name_b) <= MAGIC_THRESHOLD:
count_a, count_b = city_names[name_a], city_names[name_b]
main_name, mistake = (name_a, name_b) if count_a > count_b else (name_b, name_a)
df = df.replace({mistake: main_name})
print(df['city'].value_counts())
which gives us as a result:
Delhi 4
Mumbai 3
Upvotes: 2