NKJ
NKJ

Reputation: 497

Map values by comparing 2 columns from different dataframes using partial match in python

I have 2 dataframes df1 containing id_number and df2 containing identity_No.

Looking to map highest matching row values from Dataframe2 to Dataframe1 using conditions. Want to compare each row(df1['id_number']) from df1 with the whole column(df2['identity_No']) of df2. I have tried using partial match as well but not getting the output.

df1

score   id_number       company_name      company_code     match_acc     action_reqd
20      IN2231D           AXN pvt Ltd        IN225                          Yes
45      UK654IN        Aviva Intl Ltd        IN115                          No
65      SL1432H   Ship Incorporations        CZ555                          Yes
35      LK0678G  Oppo Mobiles pvt ltd        PQ795                          Yes
59      NG5678J             Nokia Inc        RS885                          No
20      IN2231D           AXN pvt Ltd        IN215                          Yes

df2

OR_score   identity_No       comp_name        comp_code   
51          UK654IN        Aviva Int.L Ltd       IN515  
25          SL6752J       Ship Inc Traders       CZ555  
79          NG5678K             Nokia Inc        RS005 
20          IN22312           AXN pvt Ltd        IN255
38          LK0665G       Oppo Mobiles ltd       PQ895 

For Eg: The df1.id_number need to be compared with df2.identity_No, Looking to match based on row1 of df1['id_number'] will match across all rows of df2['identity_No'], and has highest match percentage wrt. row4 of df2['identity_No'] , and its more than 80%, it will copy the respective values from row4 of df2 to row1 of df1. same to be applied for each row of df1.

Expected Output:

score   id_number       company_name      company_code     match_acc     action_reqd
20      IN22312           AXN pvt Ltd        IN225              90          Yes
51      UK654IN       Aviva Int.L Ltd        IN115              100         No
25      SL1432H   Ship Incorporations        CZ555              30          Yes
38      LK0665G      Oppo Mobiles ltd        PQ795              80          Yes
79      NG5678K             Nokia Inc        RS885              85          No
20      IN22312           AXN pvt Ltd        IN225              90          Yes

I have tried this now:

for index, row in df1.iterrows():
    for index2, config2 in df2.iterrows():
        if process.extractOne(row["id_number"], df["identity_No"])[1] >=80:
            df1['id_number'][index] = config2['identity_No']
            df1['company_name'][index] = config2['comp_name']
            df1['company_code'][index] = config2['comp_code']
            df1['score'][index] = config2['OR_Score']

Attempt2

for index, row in df1.iterrows():
        for index2, config2 in df2.iterrows():
            if fuzz.partial_ratio(row["id_number"], config2["identity_No"]) >=80:

Please Suggest

Upvotes: 1

Views: 826

Answers (1)

Corralien
Corralien

Reputation: 120479

Starting from my previous answer.

Comments embedded:

import pandas as pd
from fuzzywuzzy import process

# Column mapping between the 2 dataframes
cols1 = ["score", "id_number", "company_name", "company_code"]
cols2 = ["OR_score", "identity_No", "comp_name", "comp_code"]

# Find the single best match above a score in a list of choices.
dfm = pd.DataFrame(df1["id_number"].apply(lambda x: process.extractOne(x, df2["identity_No"]))
                                   .tolist(), columns=["match_comp", "match_acc", "match_idx"])

# Get the indexes of (df1, df2) which satisfy the condition (match_acc> 80)
idx1, idx2 = dfm.loc[dfm["match_acc"] > 80, "match_idx"].reset_index().values.T.tolist()

# Update values from df2 to df1
df1.loc[idx1, cols1] = df2.loc[idx2, cols2].values
df1["match_acc"] = dfm["match_acc"]  # don't forget match_acc column
>>> df1
   score id_number          company_name company_code  match_acc action_reqd
0     20   IN22312           AXN pvt Ltd        IN255         86         Yes
1     51   UK654IN       Aviva Int.L Ltd        IN515        100          No
2     65   SL1432H   Ship Incorporations        CZ555         43         Yes
3     35   LK0678G  Oppo Mobiles pvt ltd        PQ795         71         Yes
4     79   NG5678K             Nokia Inc        RS005         86          No
5     20   IN22312           AXN pvt Ltd        IN255         86         Yes

Tested on your input data:

df1 = pd.read_csv(io.StringIO("""score,id_number,company_name,company_code,match_acc,action_reqd
20,IN2231D,AXN pvt Ltd,IN225,,Yes
45,UK654IN,Aviva Intl Ltd,IN115,,No
65,SL1432H,Ship Incorporations,CZ555,,Yes
35,LK0678G,Oppo Mobiles pvt ltd,PQ795,,Yes
59,NG5678J,Nokia Inc,RS885,,No
20,IN2231D,AXN pvt Ltd,IN215,,Yes"""))

df2 = pd.read_csv(io.StringIO("""OR_score,identity_No,comp_name,comp_code
51,UK654IN,Aviva Int.L Ltd,IN515
25,SL6752J,Ship Inc Traders,CZ555
79,NG5678K,Nokia Inc,RS005
20,IN22312,AXN pvt Ltd,IN255
38,LK0665G,Oppo Mobiles ltd,PQ895"""))

Upvotes: 1

Related Questions