Get Results on based on range and combination of columns

Question

Sample DF:

ID   Name        Price     Sum       
1    Apple         10      180        
2    Apple         10      100        
3    Apple         10      80         
4    Orange        12      180        
5    Orange        12      190         
6    Banana        15      50       
7    Banana        15      30

Expected DF:

ID   Name        Price     Sum     Result  
1    Apple         10      180      Full Match       
2    Orange        12      180      Match - High Confidence  
3    Orange        12      190      Match - High Confidence   
4    Banana        15      50       Match - Low Confidence
5    Banana        15      30       Match - Low Confidence

Problem Statement:

I need a Result columns which works on following conditions:

With a group Name and Price i.e. Name is Apple and Price is 10, if the highest value is not +- 30 range with others then keep highest value row with result column as Full Match and delete others (Sample Df - ID 1,2,3 and expected DF is only ID 1)
Within a group Name and Price if it is in +- 30 range and also greater than 100 then result columns will be Match - High Confidence and no rows being deleted (Sample Df - ID 4,5 and expected DF is ID 2,3)
Within a group Name and Price if it is in +- 30 range and less than 60 then result columns will be Match - Low Confidence and no rows being deleted (Sample Df - ID 6,7 and expected DF is ID 4,5)

I am not able to find anything on how to solve this when there is a combination of groups involved plus the range. Any help ?

anky · Accepted Answer

Here is what I came up with, you can give it a try:

#get absolute difference from max value
df['diff_abs']=abs(df.Sum-df.groupby(['Name','Price'])['Sum'].transform('max')) 
#check if diff less than 30 remove them
m=df.loc[df.duplicated(['Name','Price'],keep=False)&df.diff_abs.lt(30)].reset_index()
print(m)

   index  ID    Name  Price  Sum  diff_abs
0      0   1   Apple     10  180         0
1      3   4  Orange     12  180        10
2      4   5  Orange     12  190         0
3      5   6  Banana     15   50         0
4      6   7  Banana     15   30        20

c1=~m.duplicated(['Name','Price'],keep=False) #check if entry is just 1 and no dups
c2=m.duplicated(['Name','Price'],keep=False)&m.Sum.lt(60) #if dups check for less than 60
m['result']=np.select([c1,c2],\
   ['Full Match','Match - Low Confidence'],'Match - High Confidence')
print(m)

   index  ID    Name  Price  Sum  diff_abs                   result
0      0   1   Apple     10  180         0               Full Match
1      3   4  Orange     12  180        10  Match - High Confidence
2      4   5  Orange     12  190         0  Match - High Confidence
3      5   6  Banana     15   50         0   Match - Low Confidence
4      6   7  Banana     15   30        20   Match - Low Confidence

Get Results on based on range and combination of columns

Answers (2)

Related Questions