Reputation: 462
I have two data frames df1
and df2
as shown below:
df1
Date ID Amount BillNo1
10/08/2020 ABBCSQ1ZA 878 2020/156
10/08/2020 ABBCSQ1ZA 878 2020/157
10/12/2020 AC928Q1ZS 3998 343SONY
10/14/2020 AC9268RE3 198 432
10/16/2020 AA171E1Z0 5490 AIPO325
10/19/2020 BU073C1ZW 3432 IDBI436-Total
10/19/2020 BU073C1ZW 3432 IDBI437-Total
df2
Date ID Amount BillNo2
10/08/2020 ABBCSQ1ZA 878 156
10/11/2020 ATRC95REW 115 265
10/14/2020 AC9268RE3 198 A/432
10/16/2020 AA171E1Z0 5490 325
10/19/2020 BU073C1ZW 3432 436
10/19/2020 BU073C1ZW 3432 437
My final answer should be:
Matched
Date ID Amount BillNo1 BillNo2
10/08/2020 ABBCSQ1ZA 878 2020/156 156 # 156 matches
10/14/2020 AC9268RE3 198 432 A/432 # 432 matches
10/16/2020 AA171E1Z0 5490 AIPO325 325 # 325 matches
10/19/2020 BU073C1ZW 3432 IDBI436-Total 436 # 436 matches
10/19/2020 BU073C1ZW 3432 IDBI437-Total 437 # 437 matches
Non Matched
Date ID Amount BillNo1 BillNo2
10/08/2020 ABBCSQ1ZA 878 2020/157 NaN
10/12/2020 AC928Q1ZS 3998 343SONY NaN
10/11/2020 ATRC95REW 115 NaN 265
How do I merge two dataframes based on partial string match of Column =['BillNo1','BillNo2']
?
Upvotes: 1
Views: 72
Reputation: 75080
You can define your own thresholds, but one proposal is below:
import difflib
from functools import partial
#the below function is inspired from https://stackoverflow.com/a/56521804/9840637
def get_closest_match(x,y):
"""x=possibilities , y = input"""
f = partial(
difflib.get_close_matches, possibilities=x.unique(), n=1,cutoff=0.5)
matches = y.astype(str).drop_duplicates().map(f).fillna('').str[0]
return pd.DataFrame([y,matches.rename('BillNo2')]).T
temp = get_closest_match(df2['BillNo2'],df1['BillNo1'])
temp['BillNo2'] = (temp['BillNo2']
.fillna(df1['BillNo1']
.str.extract('('+'|'.join(df2['BillNo2'])+')',expand=False)))
merged = (df1.assign(BillNo2=df1['BillNo1'].map(dict(temp.values)))
.merge(df2.drop_duplicates(),on=['Date','ID','Amount','BillNo2']
,how='outer',indicator=True))
print(merged)
Date ID Amount BillNo1 BillNo2 _merge
0 10/08/2020 ABBCSQ1ZA 878 2020/156 156 both
1 10/08/2020 ABBCSQ1ZA 878 2020/157 NaN left_only
2 10/12/2020 AC928Q1ZS 3998 343SONY NaN left_only
3 10/14/2020 AC9268RE3 198 432 A/432 both
4 10/16/2020 AA171E1Z0 5490 AIPO325 325 both
5 10/19/2020 BU073C1ZW 3432 IDBI436-Total 436 both
6 10/19/2020 BU073C1ZW 3432 IDBI437-Total 437 both
7 10/11/2020 ATRC95REW 115 NaN 265 right_only
Once you have above merged df, you can do;
matched = merged.query("_merge=='both'")
unmatched = merged.query("_merge!='both'")
print("Matched Df \n ", matched,'\n\n',"Unmatched Df \n " , unmatched)
Matched Df
Date ID Amount BillNo1 BillNo2 _merge
0 10/08/2020 ABBCSQ1ZA 878 2020/156 156 both
3 10/14/2020 AC9268RE3 198 432 A/432 both
4 10/16/2020 AA171E1Z0 5490 AIPO325 325 both
5 10/19/2020 BU073C1ZW 3432 IDBI436-Total 436 both
6 10/19/2020 BU073C1ZW 3432 IDBI437-Total 437 both
Unmatched Df
Date ID Amount BillNo1 BillNo2 _merge
1 10/08/2020 ABBCSQ1ZA 878 2020/157 NaN left_only
2 10/12/2020 AC928Q1ZS 3998 343SONY NaN left_only
7 10/11/2020 ATRC95REW 115 NaN 265 right_only
Upvotes: 3