Reputation: 4100
Sample DF:
ID Name Price Sum
1 Apple 10 180
2 Apple 10 100
3 Apple 10 80
4 Orange 12 180
5 Orange 12 190
6 Banana 15 50
7 Banana 15 30
Expected DF:
ID Name Price Sum Result
1 Apple 10 180 Full Match
2 Orange 12 180 Match - High Confidence
3 Orange 12 190 Match - High Confidence
4 Banana 15 50 Match - Low Confidence
5 Banana 15 30 Match - Low Confidence
Problem Statement:
I need a Result
columns which works on following conditions:
With a group Name
and Price
i.e. Name is Apple and Price is 10, if the highest value is not +- 30 range with others then keep highest value row with result column as Full Match
and delete others (Sample Df - ID 1,2,3 and expected DF is only ID 1)
Within a group Name
and Price
if it is in +- 30 range and also greater than 100 then result columns will be Match - High Confidence
and no rows being deleted (Sample Df - ID 4,5 and expected DF is ID 2,3)
Within a group Name
and Price
if it is in +- 30 range and less than 60 then result columns will be Match - Low Confidence
and no rows being deleted (Sample Df - ID 6,7 and expected DF is ID 4,5)
I am not able to find anything on how to solve this when there is a combination of groups involved plus the range. Any help ?
Upvotes: 2
Views: 401
Reputation: 862641
I think you need:
#get Series for maximal value of group
maxpergroup = df.groupby(['Name','Price'])['Sum'].transform('max')
#subtract values, get absolute values and compare by greater 30
m1 = df['Sum'].sub(maxpergroup).abs().gt(30)
#get all groups where at least one True
m11 = m1.groupby([df['Name'],df['Price']]).transform('any')
#print (m11)
#compare by another values and test if all values matching per groups
m2 = df['Sum'].gt(100)
m22 = (m2 & ~m1).groupby([df['Name'],df['Price']]).transform('all')
#print (m22)
m3 = df['Sum'].lt(60)
m33 = (m3 & ~m1).groupby([df['Name'],df['Price']]).transform('all')
#print (m33)
#create new column
masks = [m11,m22, m33]
vals = ['Full Match','Match - Low Confidence','Match - High Confidence']
df['result'] = np.select(masks, vals)
#remove unnecessary rows
df = df[~m11 | df['Sum'].eq(maxpergroup)]
print (df)
ID Name Price Sum result
0 1 Apple 10 180 Full Match
3 4 Orange 12 180 Match - Low Confidence
4 5 Orange 12 190 Match - Low Confidence
5 6 Banana 15 50 Match - High Confidence
6 7 Banana 15 30 Match - High Confidence
Upvotes: 1
Reputation: 75080
Here is what I came up with, you can give it a try:
#get absolute difference from max value
df['diff_abs']=abs(df.Sum-df.groupby(['Name','Price'])['Sum'].transform('max'))
#check if diff less than 30 remove them
m=df.loc[df.duplicated(['Name','Price'],keep=False)&df.diff_abs.lt(30)].reset_index()
print(m)
index ID Name Price Sum diff_abs
0 0 1 Apple 10 180 0
1 3 4 Orange 12 180 10
2 4 5 Orange 12 190 0
3 5 6 Banana 15 50 0
4 6 7 Banana 15 30 20
c1=~m.duplicated(['Name','Price'],keep=False) #check if entry is just 1 and no dups
c2=m.duplicated(['Name','Price'],keep=False)&m.Sum.lt(60) #if dups check for less than 60
m['result']=np.select([c1,c2],\
['Full Match','Match - Low Confidence'],'Match - High Confidence')
print(m)
index ID Name Price Sum diff_abs result
0 0 1 Apple 10 180 0 Full Match
1 3 4 Orange 12 180 10 Match - High Confidence
2 4 5 Orange 12 190 0 Match - High Confidence
3 5 6 Banana 15 50 0 Match - Low Confidence
4 6 7 Banana 15 30 20 Match - Low Confidence
Upvotes: 1