Reputation: 605

How to correct the sates names using dictionary in python

I currently have a csv file. The data originally is derived from PDF and doing a further analysis on the data, There is an [States] column where some the values are spelled wrong, i.e it contains wrong spelling or integers.

I need instead of numbers or the wrong spelling in the rows, It can parse First 3 characters or last 3 Characters and match it with the dictionary Firstly it needs to match with first 3 character if it is not matching then look for the last 3 character and gives the correct values to be saved into it.

Such as in the following example:

state_name
Assan
Andhra Prade5h
M1zoram
Uttar Pr8desh

Expected Output:

state_name
Assam
Andhra Pradesh
Mizoram
Uttar Pradesh

What I have tried so far:

dict = {'Assam','Andhra Pradesh', 'Mizoram', 'Uttar Pradesh'} 

##Didn't worked
df['state_name'] = df['state_name'].map(dict).fillna(df['state_name'])

##Then Tried Using the below method but not able to perform the further process

state1 = df['state_name'].str[3:]
state2 = df['state_name'].str[:3]

So I have no clue how to handle this problem.

Upvotes: 0

Answers (3)

anky

Reputation: 75120

here is one with product and Levenshtein , you can test to see how it performs on the original df:

import Levenshtein
d = {'Assam','Andhra Pradesh', 'Mizoram', 'Uttar Pradesh'}

i = pd.MultiIndex.from_product((df['state_name'],d))
def f(x) : return Levenshtein.ratio(*x)
out = pd.DataFrame.from_records(i,columns=['Inp','Output']).assign(Score=i.map(f))
out = out.loc[out.groupby('Inp')['Score'].idxmax()]

print(out)
               Inp          Output     Score
7   Andhra Prade5h  Andhra Pradesh  0.928571
2            Assan           Assam  0.800000
9          M1zoram         Mizoram  0.857143
12   Uttar Pr8desh   Uttar Pradesh  0.923077

Upvotes: 2

ALollz

Reputation: 59579

You can use a partial_ratio to find the closest match from your list of matches. Because this will be slow, you can first check for exact matches and then correct only the others. The output contains the original value, what was matched and then if it was a fuzzy match the match ratio, which you can use to determine the quality of the match.

Setup

import pandas as pd
from itertools import zip_longest
from fuzzywuzzy import fuzz

df = pd.DataFrame({'state_name': ['Assan', 'Andhra Prade5h', 'M1zoram', 'Uttar Pr8desh',
                                  'Assam']})
matches = ['Assam', 'Andhra Pradesh', 'Mizoram', 'Uttar Pradesh']

Code

# Exact Matches
df['match'] = df['state_name'].where(df['state_name'].isin(matches))

# Split to figure out what didn't match exactly
m = df['match'].notnull()
df_corr = df[m].copy()   # These are fine, exact matches
df_inco = df[~m].copy()  # Need to fix

# Only if something needs to be fixed.
if not df_inco.empty:
    l = []
    for idx, row in df_inco.iterrows():
        z = zip_longest([row.state_name], matches, fillvalue=row.state_name)
        s = pd.DataFrame([(x[1], fuzz.ratio(x[0], x[1])) for x in z], 
                         columns=['match', 'fuzz_ratio'])
        s = s.loc[[s['fuzz_ratio'].idxmax()]]  # Best Match
        s.index = [idx] # Set index so we can just combine first after the loop.
        l.append(s)

    # Add the best match based on the fuzzy ratio
    df_inco = df_inco.combine_first(pd.concat(l)) 

    df = pd.concat([df_inco, df_corr], sort=False).sort_index()

Output

print(df)

   fuzz_ratio           match      state_name
0        80.0           Assam           Assan
1        93.0  Andhra Pradesh  Andhra Prade5h
2        86.0         Mizoram         M1zoram
3        92.0   Uttar Pradesh   Uttar Pr8desh
4         NaN           Assam           Assam

Upvotes: 1

Inputvector

Reputation: 1093

Here is a solution. Basically it compares your input with your correct name list and return best match(the most similar):

from difflib import SequenceMatcher

lst = ['Assam','Andhra Pradesh', 'Mizoram', 'Uttar Pradesh'] #Correct Name List

def closest(s):
    highest = 0
    result = ''
    for i in lst:
        temp = SequenceMatcher(None, s, i).ratio() #Similarity Ratio
        if temp > highest:
            highest = temp
            result = i
    return result


df = pd.DataFrame(['Assan','Andhra Prade5h','M1zoram','Uttar Pr8desh'], columns = ["state_name"])

#Output Before
    state_name
0   Assan
1   Andhra Prade5h
2   M1zoram
3   Uttar Pr8desh


df['state_name'] = df['state_name'].apply(lambda x: closest(x))

#Output After
    state_name
0   Assam
1   Andhra Pradesh
2   Mizoram
3   Uttar Pradesh