Reputation: 53
I am calculating the total sum of 'price' for each 'id'. But when there are duplicates on 'loc_id' it should use the highest price for calculations and ignore the lower prices for the same 'loc_id'.
The example below shows 3 duplicates for A-1. The highest price for A-1 is 100 so the total sum for A should be 100 + 200
df
id loc_id price
A A-1 50
A A-1 100
A A-1 30
A A-2 200
B B-1 30
B B-2 50
df_expected_result
id total average
A 300 150
B 80 40
without the max() method, the code runs but then it double counts the loc_id duplicates. I am trying to avoid double counting. How can I make it select only the max value for each unique loc_id?
(df.set_index(['id','loc_id'])
.groupby(level=0)['price']
.max()
.agg({'total' : np.sum , 'average' : np.average })
.sort_values('total',ascending=False))
I get an error when I include max() method:
ValueError: cannot perform both aggregation and transformation operations simultaneously
Upvotes: 5
Views: 2187
Reputation: 828
You can use: First getting the maximum value from the groupby
and then finding the sum and average.
df = pd.DataFrame({'id':['A','A','A','A','B','B'], 'loc_id': ['A-1','A-1','A-1','A-2','B-1','B-2'], 'price':[50,100,30,200,30,50]})
df = df.groupby(['id','loc_id']).max().reset_index().groupby('id')['price']
df_new = df.sum().reset_index()
df_new['average'] = df.mean().reset_index()['price']
df_new.rename(columns={'price':'total'}, inplace=True)
df_new
id total average
0 A 300 150
1 B 80 40
Upvotes: 0
Reputation: 75100
Try with a double groupby()
:
(df.groupby(['id','loc_id'],as_index=False)['price'].max()
.groupby('id')['price'].agg(['sum','mean']))
sum mean
id
A 300 150
B 80 40
Upvotes: 4
Reputation: 323316
Here is one way using sort_values
+ drop_duplicates
df=df.sort_values(['price']).drop_duplicates(['id','loc_id'],keep='last')
df.groupby(['id']).price.agg(['mean','sum'])
Out[366]:
mean sum
id
A 150 300
B 40 80
Upvotes: 3