Reputation: 497
I have 2 dataframes df1
containing id_number
and df2
containing identity_No
.
Looking to map highest matching row values from Dataframe2 to Dataframe1 using conditions. Want to compare each row(df1['id_number']) from df1 with the whole column(df2['identity_No']) of df2. I have tried using partial match as well but not getting the output.
df1
score id_number company_name company_code match_acc action_reqd
20 IN2231D AXN pvt Ltd IN225 Yes
45 UK654IN Aviva Intl Ltd IN115 No
65 SL1432H Ship Incorporations CZ555 Yes
35 LK0678G Oppo Mobiles pvt ltd PQ795 Yes
59 NG5678J Nokia Inc RS885 No
20 IN2231D AXN pvt Ltd IN215 Yes
df2
OR_score identity_No comp_name comp_code
51 UK654IN Aviva Int.L Ltd IN515
25 SL6752J Ship Inc Traders CZ555
79 NG5678K Nokia Inc RS005
20 IN22312 AXN pvt Ltd IN255
38 LK0665G Oppo Mobiles ltd PQ895
For Eg: The df1.id_number
need to be compared with df2.identity_No
, Looking to match based on row1 of df1['id_number']
will match across all rows of df2['identity_No']
, and has highest match percentage wrt. row4 of df2['identity_No']
, and its more than 80%, it will copy the respective values from row4 of df2 to row1 of df1. same to be applied for each row of df1.
Expected Output:
score id_number company_name company_code match_acc action_reqd
20 IN22312 AXN pvt Ltd IN225 90 Yes
51 UK654IN Aviva Int.L Ltd IN115 100 No
25 SL1432H Ship Incorporations CZ555 30 Yes
38 LK0665G Oppo Mobiles ltd PQ795 80 Yes
79 NG5678K Nokia Inc RS885 85 No
20 IN22312 AXN pvt Ltd IN225 90 Yes
I have tried this now:
for index, row in df1.iterrows():
for index2, config2 in df2.iterrows():
if process.extractOne(row["id_number"], df["identity_No"])[1] >=80:
df1['id_number'][index] = config2['identity_No']
df1['company_name'][index] = config2['comp_name']
df1['company_code'][index] = config2['comp_code']
df1['score'][index] = config2['OR_Score']
Attempt2
for index, row in df1.iterrows():
for index2, config2 in df2.iterrows():
if fuzz.partial_ratio(row["id_number"], config2["identity_No"]) >=80:
Please Suggest
Upvotes: 1
Views: 826
Reputation: 120479
Starting from my previous answer.
Comments embedded:
import pandas as pd
from fuzzywuzzy import process
# Column mapping between the 2 dataframes
cols1 = ["score", "id_number", "company_name", "company_code"]
cols2 = ["OR_score", "identity_No", "comp_name", "comp_code"]
# Find the single best match above a score in a list of choices.
dfm = pd.DataFrame(df1["id_number"].apply(lambda x: process.extractOne(x, df2["identity_No"]))
.tolist(), columns=["match_comp", "match_acc", "match_idx"])
# Get the indexes of (df1, df2) which satisfy the condition (match_acc> 80)
idx1, idx2 = dfm.loc[dfm["match_acc"] > 80, "match_idx"].reset_index().values.T.tolist()
# Update values from df2 to df1
df1.loc[idx1, cols1] = df2.loc[idx2, cols2].values
df1["match_acc"] = dfm["match_acc"] # don't forget match_acc column
>>> df1
score id_number company_name company_code match_acc action_reqd
0 20 IN22312 AXN pvt Ltd IN255 86 Yes
1 51 UK654IN Aviva Int.L Ltd IN515 100 No
2 65 SL1432H Ship Incorporations CZ555 43 Yes
3 35 LK0678G Oppo Mobiles pvt ltd PQ795 71 Yes
4 79 NG5678K Nokia Inc RS005 86 No
5 20 IN22312 AXN pvt Ltd IN255 86 Yes
Tested on your input data:
df1 = pd.read_csv(io.StringIO("""score,id_number,company_name,company_code,match_acc,action_reqd
20,IN2231D,AXN pvt Ltd,IN225,,Yes
45,UK654IN,Aviva Intl Ltd,IN115,,No
65,SL1432H,Ship Incorporations,CZ555,,Yes
35,LK0678G,Oppo Mobiles pvt ltd,PQ795,,Yes
59,NG5678J,Nokia Inc,RS885,,No
20,IN2231D,AXN pvt Ltd,IN215,,Yes"""))
df2 = pd.read_csv(io.StringIO("""OR_score,identity_No,comp_name,comp_code
51,UK654IN,Aviva Int.L Ltd,IN515
25,SL6752J,Ship Inc Traders,CZ555
79,NG5678K,Nokia Inc,RS005
20,IN22312,AXN pvt Ltd,IN255
38,LK0665G,Oppo Mobiles ltd,PQ895"""))
Upvotes: 1