Arthur
Arthur

Reputation: 17

Change Typo Column Values with Right Word based on Columns in Other Dataframe

I have two dataframe, the first one is location ,

location = pd.DataFrame({'city': ['RIYADH','SEOUL','BUSAN','TOKYO','OSAKA'],
                    'country': ['Saudi Arabia','South Korea','South Korea','Japan','Japan']})

the other one is customer,

customer = pd.DataFrame({'id': [1001,2002,3003,4004,5005,6006,7007,8008,9009],
                        'city': ['tokio','Sorth KOREA','riadh','JAPANN','tokyo','osako','Arab Saudi','SEOUL','buSN']})

I want to change the typo word in location column in customer dataframe with the right one in city/country from location dataframe. So the output will be like this:

id    location
1001  TOKYO
2002  South Korea
3003  RIYADH
4004  Japan
5005  TOKYO
6006  OSAKA
7007  Saudi Arabia
8008  SEOUL
9009  BUSAN

Upvotes: 1

Views: 98

Answers (1)

PaulS
PaulS

Reputation: 25313

A possible solution, based on RapidFuzz:

from rapidfuzz import process

out = (customer.assign(
    aux = customer['city']
    .map(lambda x: 
        process.extractOne(x, location['city']+'*'+location['country'])[0])))

out[['aux1', 'aux2']] = out['aux'].str.split(r'\*', expand=True)
out['city'] = out.apply(lambda x: 
    process.extractOne(x['city'], x.loc['aux1':'aux2'])[0], axis=1)
out = out.drop(columns=['aux', 'aux1', 'aux2'])

Output:

     id          city
0  1001         TOKYO
1  2002   South Korea
2  3003        RIYADH
3  4004         Japan
4  5005         TOKYO
5  6006         OSAKA
6  7007  Saudi Arabia
7  8008         SEOUL
8  9009         BUSAN

EDIT

This tries to offer a solution for the OP's below comment:

from rapidfuzz import process

def get_match(x, y, score):
    match = process.extractOne(x, y)
    return np.nan if match[1] < score else match[0]

out = (customer.assign(
    aux=customer['city']
    .map(lambda x:
         process.extractOne(x, location['city']+'*'+location['country'])[0])))

out[['aux1', 'aux2']] = out['aux'].str.split(r'\*', expand=True)
out['city'] = out.apply(lambda x: get_match(
    x['city'], x.loc['aux1':'aux2'], 92), axis=1)
out = out.drop(columns=['aux', 'aux1', 'aux2'])

Output:

     id   city
0  1001    NaN
1  2002    NaN
2  3003    NaN
3  4004    NaN
4  5005  TOKYO
5  6006    NaN
6  7007    NaN
7  8008  SEOUL
8  9009    NaN

Upvotes: 2

Related Questions