Reputation: 605
I currently have a csv file. The data originally is derived from PDF and doing a further analysis on the data, There is an [States] column where some the values are spelled wrong, i.e it contains wrong spelling or integers.
I need instead of numbers or the wrong spelling in the rows, It can parse First 3 characters or last 3 Characters and match it with the dictionary Firstly it needs to match with first 3 character if it is not matching then look for the last 3 character and gives the correct values to be saved into it.
Such as in the following example:
state_name
Assan
Andhra Prade5h
M1zoram
Uttar Pr8desh
Expected Output:
state_name
Assam
Andhra Pradesh
Mizoram
Uttar Pradesh
What I have tried so far:
dict = {'Assam','Andhra Pradesh', 'Mizoram', 'Uttar Pradesh'}
##Didn't worked
df['state_name'] = df['state_name'].map(dict).fillna(df['state_name'])
##Then Tried Using the below method but not able to perform the further process
state1 = df['state_name'].str[3:]
state2 = df['state_name'].str[:3]
So I have no clue how to handle this problem.
Upvotes: 0
Views: 95
Reputation: 75120
here is one with product and Levenshtein , you can test to see how it performs on the original df:
import Levenshtein
d = {'Assam','Andhra Pradesh', 'Mizoram', 'Uttar Pradesh'}
i = pd.MultiIndex.from_product((df['state_name'],d))
def f(x) : return Levenshtein.ratio(*x)
out = pd.DataFrame.from_records(i,columns=['Inp','Output']).assign(Score=i.map(f))
out = out.loc[out.groupby('Inp')['Score'].idxmax()]
print(out)
Inp Output Score
7 Andhra Prade5h Andhra Pradesh 0.928571
2 Assan Assam 0.800000
9 M1zoram Mizoram 0.857143
12 Uttar Pr8desh Uttar Pradesh 0.923077
Upvotes: 2
Reputation: 59579
You can use a partial_ratio
to find the closest match from your list of matches. Because this will be slow, you can first check for exact matches and then correct only the others. The output contains the original value, what was matched and then if it was a fuzzy match the match ratio, which you can use to determine the quality of the match.
import pandas as pd
from itertools import zip_longest
from fuzzywuzzy import fuzz
df = pd.DataFrame({'state_name': ['Assan', 'Andhra Prade5h', 'M1zoram', 'Uttar Pr8desh',
'Assam']})
matches = ['Assam', 'Andhra Pradesh', 'Mizoram', 'Uttar Pradesh']
# Exact Matches
df['match'] = df['state_name'].where(df['state_name'].isin(matches))
# Split to figure out what didn't match exactly
m = df['match'].notnull()
df_corr = df[m].copy() # These are fine, exact matches
df_inco = df[~m].copy() # Need to fix
# Only if something needs to be fixed.
if not df_inco.empty:
l = []
for idx, row in df_inco.iterrows():
z = zip_longest([row.state_name], matches, fillvalue=row.state_name)
s = pd.DataFrame([(x[1], fuzz.ratio(x[0], x[1])) for x in z],
columns=['match', 'fuzz_ratio'])
s = s.loc[[s['fuzz_ratio'].idxmax()]] # Best Match
s.index = [idx] # Set index so we can just combine first after the loop.
l.append(s)
# Add the best match based on the fuzzy ratio
df_inco = df_inco.combine_first(pd.concat(l))
df = pd.concat([df_inco, df_corr], sort=False).sort_index()
print(df)
fuzz_ratio match state_name
0 80.0 Assam Assan
1 93.0 Andhra Pradesh Andhra Prade5h
2 86.0 Mizoram M1zoram
3 92.0 Uttar Pradesh Uttar Pr8desh
4 NaN Assam Assam
Upvotes: 1
Reputation: 1093
Here is a solution. Basically it compares your input with your correct name list and return best match(the most similar):
from difflib import SequenceMatcher
lst = ['Assam','Andhra Pradesh', 'Mizoram', 'Uttar Pradesh'] #Correct Name List
def closest(s):
highest = 0
result = ''
for i in lst:
temp = SequenceMatcher(None, s, i).ratio() #Similarity Ratio
if temp > highest:
highest = temp
result = i
return result
df = pd.DataFrame(['Assan','Andhra Prade5h','M1zoram','Uttar Pr8desh'], columns = ["state_name"])
#Output Before
state_name
0 Assan
1 Andhra Prade5h
2 M1zoram
3 Uttar Pr8desh
df['state_name'] = df['state_name'].apply(lambda x: closest(x))
#Output After
state_name
0 Assam
1 Andhra Pradesh
2 Mizoram
3 Uttar Pradesh
Upvotes: 2