Reputation: 623
I have two different customer dataframes and I would like to match them based on Jaccard distance matrix or any other method.
df1
Name country cost
0 raj Kazakhstan 23
1 sam Russia 243
2 kanan Belarus 2
3 Nan Nan 0
df2
Name country DOB
0 rak Kazakhstan 12-12-1903
1 sim russia 03-04-1994
2 raj Belarus 21-09-2003
3 kane Belarus 23-12-1999
Output:
if the string comparison value is greater than >0.6, I would like to combine both the rows in the new dataframe.
Df3
Name country Name country cost DOB
0 raj Kazakhstan rak Kazakhstan 23 12-12-1903
1 sam Russia sim russia 243 03-04-1994
2 kanan Belarus Kane Belarus 2 23-12-1999
I had tried doing calculating each row against each row but don't how to compare each rows against entire rows from one to other dataframe?
Upvotes: 2
Views: 1801
Reputation: 323226
I would like using fuzzywuzzy
from fuzzywuzzy import process
df1['key'] = df1.sum(1)
df2['key'] = df2.sum(1)
def yoursource(x):
if [process.extract(x, df2.key.tolist(), limit=1)][0][0][1]>60:
return [process.extract(x, df2.key.tolist(), limit=1)][0][0][0]
else :
return 'notmatch'
df1['key'] = df1.key.apply(yoursource)
After that we get the match key using merge
df = df1.merge(df2, on='key', how='inner').drop('key',1)
df
Name_x country_x Name_y country_y
0 raj Kazakhstan rak Kazakhstan
1 sam Russia sim russia
2 kanan Belarus kane Belarus
Upvotes: 6