Reputation: 63
So I used a group by on a pandas dataframe which looks like this
df.groupby(['year','month'])['AMT'].agg('sum')
And I get something like this
year month
2003 1 114.00
2 9195.00
3 300.00
5 200.00
6 450.00
7 68.00
8 750.00
9 3521.00
10 250.00
11 799.00
12 1000.00
2004 1 8551.00
2 9998.00
3 17334.00
4 2525.00
5 16014.00
6 9132.00
7 10623.00
8 7538.00
9 3650.00
10 7733.00
11 10128.00
12 4741.00
2005 1 6965.00
2 3208.00
3 8630.00
4 7776.00
5 11950.00
6 11717.00
7 1510.00
...
2015 7 1431441.00
8 966974.00
9 1121650.00
10 1200104.00
11 1312191.90
12 482535.00
2016 1 1337343.00
2 1465068.00
3 1170113.00
4 1121691.00
5 1302936.00
6 1518047.00
7 1251844.00
8 825215.00
9 1491626.00
10 1243877.00
11 1632252.00
12 750995.50
2017 1 905974.00
2 1330182.00
3 1382628.52
4 1146789.00
5 1201425.00
6 1278701.00
7 1172596.00
8 1517116.50
9 1108609.00
10 1360841.00
11 1340386.00
12 860686.00
What I want is to just select the max out of the third summed column so that the final data frame has only the max from each year, something like:
year month
2003 2 9195.00
2004 3 17334.00
2005 5 11950.00
... and so on
What do I have to add to my group by aggregation to do this?
Upvotes: 5
Views: 3540
Reputation: 862731
I think need DataFrameGroupBy.idxmax
:
s = df.groupby(['year','month'])['AMT'].sum()
out = s.loc[s.groupby(level=0).idxmax()]
#working in newer pandas versions
#out = df.loc[df.groupby('Year').idxmax()]
print (out)
Year month
2003 2 9195.0
2004 3 17334.0
2005 5 11950.0
Name: AMT, dtype: float64
If possible multiple max values per years:
out = s[s == s.groupby(level=0).transform('max')]
print (out)
Year month
2003 2 9195.0
2004 3 17334.0
2005 5 11950.0
Name: AMT, dtype: float64
Upvotes: 5
Reputation: 164693
You can use GroupBy
+ transform
with max
. Note this gives multiple maximums for any years where a tie exists. This may or may not be what you require.
As you have requested, it's possible to do this in 2 steps, first summing and then calculating maximums by year.
df = pd.DataFrame({'year': [2003, 2003, 2003, 2004, 2004, 2004],
'month': [1, 2, 2, 1, 1, 2],
'AMT': [100, 200, 100, 100, 300, 100]})
# STEP 1: sum by year + month
df2 = df.groupby(['year', 'month']).sum().reset_index()
# STEP 2: filter for max by year
res = df2[df2['AMT'] == df2.groupby(['year'])['AMT'].transform('max')]
print(res)
year month AMT
1 2003 2 300
2 2004 1 400
Upvotes: 1