Milan
Milan

Reputation: 39

Pandas merging dataframe with multiple column and tolerance

I have 2 dataframes which i want to merge using 2 columns as Keys and create another merged dataframe. Here, Column1 is String and Columns2 is float datatype. I want to put tolerance of .01 on column2 such that (John, 1.12) from dataframe1 and (John, 1.13) from dataframe2 should come in same row. How can i achieve that ?

For example : 2 Dataframes :

df1 = pd.DataFrame({"Name":["John","Millon"], "MarketVal":[1.12,2.11], "Left_Product":["Sugar","Salt"]})
df2 = pd.DataFrame({"Name":["John","Rex"], "MarketVal":[1.13,3.11], "right_Product":["Sugar","Salt"]})

if I do :

dfMerge = pd.merge(df1,df2,on=["Name","MarketVal"], how= "outer")

then this will create 2 separate rows for "John" since MarketVal is different in both dataframes. But I want to keep tolerance of .01 on this so that they come at same row.

    Actual :->
     Name  MarketVal Left_Product right_Product
0    John       1.12        Sugar           NaN
1  Millon       2.11         Salt           NaN
2    John       1.13          NaN         Sugar
3     Rex       3.11          NaN          Salt

Expected :->

     Name  MarketVal Left_Product right_Product
0    John       1.12        Sugar           Sugar
1  Millon       2.11         Salt           NaN
2     Rex       3.11          NaN          Salt

Upvotes: 3

Views: 859

Answers (1)

ALollz
ALollz

Reputation: 59579

merge only on 'Name' and then use a mask to keep only the rows that are within your desired tolerance. Because of the outer join we also keep a row if any 'MarketVal' is null. We also need to update the 'MarketVal' if it came from the right DataFrame in the outer merge.

tol = 0.01
m = df1.merge(df2, on='Name', how='outer', suffixes=['', '_r'])
m = m[(m['MarketVal'] - m['MarketVal_r']).abs().le(tol)
      | m[['MarketVal', 'MarketVal_r']].isnull().any(1)]

#     Name  MarketVal Left_Product  MarketVal_r right_Product
#0    John       1.12        Sugar         1.13         Sugar
#1  Millon       2.11         Salt          NaN           NaN
#2     Rex        NaN          NaN         3.11          Sal

m['MarketVal'] = m['MarketVal'].fillna(m['MarketVal_r'])
m = m.drop(columns='MarketVal_r')

#     Name  MarketVal Left_Product right_Product
#0    John       1.12        Sugar         Sugar
#1  Millon       2.11         Salt           NaN
#2     Rex       3.11          NaN          Salt

In the case of mulitple rows that match in df2 this will keep all of the combinations. Here I add another row for John with 'Coffee' that should match to the first row in df1, and a row that shouldn't match with anything.

df1 = pd.DataFrame({'Name': ['John', 'Millon'], 
                    'MarketVal': [1.12, 2.11], 'Left_Product':['Sugar', 'Salt']})
df2 = pd.DataFrame({'Name': ['John', 'Rex', 'John', 'John'], 
                    'MarketVal': [1.13, 3.11, 1.125, 17], 
                    'right_Product': ['Sugar', 'Salt', 'coffee', 'bad_item']})

#... the above code
print(m)

     Name  MarketVal Left_Product right_Product
0    John       1.12        Sugar         Sugar
1    John       1.12        Sugar        Coffee
3  Millon       2.11         Salt           NaN
4     Rex       3.11          NaN          Salt

Upvotes: 1

Related Questions