Reputation: 39
I have 2 dataframes which i want to merge using 2 columns as Keys and create another merged dataframe. Here, Column1 is String and Columns2 is float datatype. I want to put tolerance of .01 on column2 such that (John, 1.12) from dataframe1 and (John, 1.13) from dataframe2 should come in same row. How can i achieve that ?
For example : 2 Dataframes :
df1 = pd.DataFrame({"Name":["John","Millon"], "MarketVal":[1.12,2.11], "Left_Product":["Sugar","Salt"]})
df2 = pd.DataFrame({"Name":["John","Rex"], "MarketVal":[1.13,3.11], "right_Product":["Sugar","Salt"]})
if I do :
dfMerge = pd.merge(df1,df2,on=["Name","MarketVal"], how= "outer")
then this will create 2 separate rows for "John" since MarketVal is different in both dataframes. But I want to keep tolerance of .01 on this so that they come at same row.
Actual :->
Name MarketVal Left_Product right_Product
0 John 1.12 Sugar NaN
1 Millon 2.11 Salt NaN
2 John 1.13 NaN Sugar
3 Rex 3.11 NaN Salt
Expected :->
Name MarketVal Left_Product right_Product
0 John 1.12 Sugar Sugar
1 Millon 2.11 Salt NaN
2 Rex 3.11 NaN Salt
Upvotes: 3
Views: 859
Reputation: 59579
merge
only on 'Name'
and then use a mask to keep only the rows that are within your desired tolerance. Because of the outer join we also keep a row if any 'MarketVal' is null. We also need to update the 'MarketVal'
if it came from the right DataFrame in the outer merge.
tol = 0.01
m = df1.merge(df2, on='Name', how='outer', suffixes=['', '_r'])
m = m[(m['MarketVal'] - m['MarketVal_r']).abs().le(tol)
| m[['MarketVal', 'MarketVal_r']].isnull().any(1)]
# Name MarketVal Left_Product MarketVal_r right_Product
#0 John 1.12 Sugar 1.13 Sugar
#1 Millon 2.11 Salt NaN NaN
#2 Rex NaN NaN 3.11 Sal
m['MarketVal'] = m['MarketVal'].fillna(m['MarketVal_r'])
m = m.drop(columns='MarketVal_r')
# Name MarketVal Left_Product right_Product
#0 John 1.12 Sugar Sugar
#1 Millon 2.11 Salt NaN
#2 Rex 3.11 NaN Salt
In the case of mulitple rows that match in df2
this will keep all of the combinations. Here I add another row for John with 'Coffee' that should match to the first row in df1, and a row that shouldn't match with anything.
df1 = pd.DataFrame({'Name': ['John', 'Millon'],
'MarketVal': [1.12, 2.11], 'Left_Product':['Sugar', 'Salt']})
df2 = pd.DataFrame({'Name': ['John', 'Rex', 'John', 'John'],
'MarketVal': [1.13, 3.11, 1.125, 17],
'right_Product': ['Sugar', 'Salt', 'coffee', 'bad_item']})
#... the above code
print(m)
Name MarketVal Left_Product right_Product
0 John 1.12 Sugar Sugar
1 John 1.12 Sugar Coffee
3 Millon 2.11 Salt NaN
4 Rex 3.11 NaN Salt
Upvotes: 1