Gopal Chitalia
Gopal Chitalia

Reputation: 462

Merge two dataframe based on a common column which should be matched by partial string

I have two data frames df1 and df2 as shown below:

df1

Date        ID          Amount    BillNo1
10/08/2020  ABBCSQ1ZA   878       2020/156
10/08/2020  ABBCSQ1ZA   878       2020/157
10/12/2020  AC928Q1ZS   3998      343SONY
10/14/2020  AC9268RE3   198       432
10/16/2020  AA171E1Z0   5490      AIPO325
10/19/2020  BU073C1ZW   3432      IDBI436-Total
10/19/2020  BU073C1ZW   3432      IDBI437-Total

df2

Date        ID          Amount    BillNo2
10/08/2020  ABBCSQ1ZA   878       156
10/11/2020  ATRC95REW   115       265
10/14/2020  AC9268RE3   198       A/432
10/16/2020  AA171E1Z0   5490      325
10/19/2020  BU073C1ZW   3432      436
10/19/2020  BU073C1ZW   3432      437

My final answer should be:

Matched
Date        ID          Amount    BillNo1         BillNo2
10/08/2020  ABBCSQ1ZA   878       2020/156        156             # 156 matches
10/14/2020  AC9268RE3   198       432             A/432           # 432 matches
10/16/2020  AA171E1Z0   5490      AIPO325         325             # 325 matches
10/19/2020  BU073C1ZW   3432      IDBI436-Total   436             # 436 matches
10/19/2020  BU073C1ZW   3432      IDBI437-Total   437             # 437 matches 

Non Matched
Date        ID          Amount    BillNo1     BillNo2
10/08/2020  ABBCSQ1ZA   878       2020/157    NaN
10/12/2020  AC928Q1ZS   3998      343SONY     NaN
10/11/2020  ATRC95REW   115       NaN         265

How do I merge two dataframes based on partial string match of Column =['BillNo1','BillNo2']?

Upvotes: 1

Views: 72

Answers (1)

anky
anky

Reputation: 75080

You can define your own thresholds, but one proposal is below:

import difflib
from functools import partial
#the below function is inspired from https://stackoverflow.com/a/56521804/9840637
def get_closest_match(x,y):
    """x=possibilities , y = input"""
    f = partial(
    difflib.get_close_matches, possibilities=x.unique(), n=1,cutoff=0.5)

    matches = y.astype(str).drop_duplicates().map(f).fillna('').str[0]
    return pd.DataFrame([y,matches.rename('BillNo2')]).T

temp = get_closest_match(df2['BillNo2'],df1['BillNo1'])
temp['BillNo2'] = (temp['BillNo2']
                   .fillna(df1['BillNo1']
               .str.extract('('+'|'.join(df2['BillNo2'])+')',expand=False)))


merged = (df1.assign(BillNo2=df1['BillNo1'].map(dict(temp.values)))
.merge(df2.drop_duplicates(),on=['Date','ID','Amount','BillNo2']
,how='outer',indicator=True))

print(merged)

         Date         ID  Amount        BillNo1 BillNo2      _merge
0  10/08/2020  ABBCSQ1ZA     878       2020/156     156        both
1  10/08/2020  ABBCSQ1ZA     878       2020/157     NaN   left_only
2  10/12/2020  AC928Q1ZS    3998        343SONY     NaN   left_only
3  10/14/2020  AC9268RE3     198            432   A/432        both
4  10/16/2020  AA171E1Z0    5490        AIPO325     325        both
5  10/19/2020  BU073C1ZW    3432  IDBI436-Total     436        both
6  10/19/2020  BU073C1ZW    3432  IDBI437-Total     437        both
7  10/11/2020  ATRC95REW     115            NaN     265  right_only

Once you have above merged df, you can do;

matched = merged.query("_merge=='both'")
unmatched = merged.query("_merge!='both'")
print("Matched Df \n ", matched,'\n\n',"Unmatched Df \n " , unmatched)

Matched Df 
           Date         ID  Amount        BillNo1 BillNo2 _merge
0  10/08/2020  ABBCSQ1ZA     878       2020/156     156   both
3  10/14/2020  AC9268RE3     198            432   A/432   both
4  10/16/2020  AA171E1Z0    5490        AIPO325     325   both
5  10/19/2020  BU073C1ZW    3432  IDBI436-Total     436   both
6  10/19/2020  BU073C1ZW    3432  IDBI437-Total     437   both 

 Unmatched Df 
           Date         ID  Amount   BillNo1 BillNo2      _merge
1  10/08/2020  ABBCSQ1ZA     878  2020/157     NaN   left_only
2  10/12/2020  AC928Q1ZS    3998   343SONY     NaN   left_only
7  10/11/2020  ATRC95REW     115       NaN     265  right_only

Upvotes: 3

Related Questions