Reputation: 17
I have two dataframe, the first one is location ,
location = pd.DataFrame({'city': ['RIYADH','SEOUL','BUSAN','TOKYO','OSAKA'],
'country': ['Saudi Arabia','South Korea','South Korea','Japan','Japan']})
the other one is customer,
customer = pd.DataFrame({'id': [1001,2002,3003,4004,5005,6006,7007,8008,9009],
'city': ['tokio','Sorth KOREA','riadh','JAPANN','tokyo','osako','Arab Saudi','SEOUL','buSN']})
I want to change the typo word in location column in customer dataframe with the right one in city/country from location dataframe. So the output will be like this:
id location
1001 TOKYO
2002 South Korea
3003 RIYADH
4004 Japan
5005 TOKYO
6006 OSAKA
7007 Saudi Arabia
8008 SEOUL
9009 BUSAN
Upvotes: 1
Views: 98
Reputation: 25313
A possible solution, based on RapidFuzz
:
from rapidfuzz import process
out = (customer.assign(
aux = customer['city']
.map(lambda x:
process.extractOne(x, location['city']+'*'+location['country'])[0])))
out[['aux1', 'aux2']] = out['aux'].str.split(r'\*', expand=True)
out['city'] = out.apply(lambda x:
process.extractOne(x['city'], x.loc['aux1':'aux2'])[0], axis=1)
out = out.drop(columns=['aux', 'aux1', 'aux2'])
Output:
id city
0 1001 TOKYO
1 2002 South Korea
2 3003 RIYADH
3 4004 Japan
4 5005 TOKYO
5 6006 OSAKA
6 7007 Saudi Arabia
7 8008 SEOUL
8 9009 BUSAN
EDIT
This tries to offer a solution for the OP's below comment:
from rapidfuzz import process
def get_match(x, y, score):
match = process.extractOne(x, y)
return np.nan if match[1] < score else match[0]
out = (customer.assign(
aux=customer['city']
.map(lambda x:
process.extractOne(x, location['city']+'*'+location['country'])[0])))
out[['aux1', 'aux2']] = out['aux'].str.split(r'\*', expand=True)
out['city'] = out.apply(lambda x: get_match(
x['city'], x.loc['aux1':'aux2'], 92), axis=1)
out = out.drop(columns=['aux', 'aux1', 'aux2'])
Output:
id city
0 1001 NaN
1 2002 NaN
2 3003 NaN
3 4004 NaN
4 5005 TOKYO
5 6006 NaN
6 7007 NaN
7 8008 SEOUL
8 9009 NaN
Upvotes: 2