Souvik Mondal
Souvik Mondal

Reputation: 63

How to get the max out of a group by on two columns and sum on third in a pandas dataframe?

So I used a group by on a pandas dataframe which looks like this

df.groupby(['year','month'])['AMT'].agg('sum')

And I get something like this

year  month
2003  1            114.00
      2           9195.00
      3            300.00
      5            200.00
      6            450.00
      7             68.00
      8            750.00
      9           3521.00
      10           250.00
      11           799.00
      12          1000.00
2004  1           8551.00
      2           9998.00
      3          17334.00
      4           2525.00
      5          16014.00
      6           9132.00
      7          10623.00
      8           7538.00
      9           3650.00
      10          7733.00
      11         10128.00
      12          4741.00
2005  1           6965.00
      2           3208.00
      3           8630.00
      4           7776.00
      5          11950.00
      6          11717.00
      7           1510.00
              ...    
2015  7        1431441.00
      8         966974.00
      9        1121650.00
      10       1200104.00
      11       1312191.90
      12        482535.00
2016  1        1337343.00
      2        1465068.00
      3        1170113.00
      4        1121691.00
      5        1302936.00
      6        1518047.00
      7        1251844.00
      8         825215.00
      9        1491626.00
      10       1243877.00
      11       1632252.00
      12        750995.50
2017  1         905974.00
      2        1330182.00
      3        1382628.52
      4        1146789.00
      5        1201425.00
      6        1278701.00
      7        1172596.00
      8        1517116.50
      9        1108609.00
      10       1360841.00
      11       1340386.00
      12        860686.00

What I want is to just select the max out of the third summed column so that the final data frame has only the max from each year, something like:

year  month
2003      2           9195.00
2004      3          17334.00
2005      5          11950.00

... and so on

What do I have to add to my group by aggregation to do this?

Upvotes: 5

Views: 3540

Answers (2)

jezrael
jezrael

Reputation: 862731

I think need DataFrameGroupBy.idxmax:

s = df.groupby(['year','month'])['AMT'].sum()
out = s.loc[s.groupby(level=0).idxmax()]
#working in newer pandas versions
#out = df.loc[df.groupby('Year').idxmax()]
print (out)
Year  month
2003  2         9195.0
2004  3        17334.0
2005  5        11950.0
Name: AMT, dtype: float64

If possible multiple max values per years:

out = s[s == s.groupby(level=0).transform('max')]
print (out)
Year  month
2003  2         9195.0
2004  3        17334.0
2005  5        11950.0
Name: AMT, dtype: float64

Upvotes: 5

jpp
jpp

Reputation: 164693

You can use GroupBy + transform with max. Note this gives multiple maximums for any years where a tie exists. This may or may not be what you require.

As you have requested, it's possible to do this in 2 steps, first summing and then calculating maximums by year.

df = pd.DataFrame({'year': [2003, 2003, 2003, 2004, 2004, 2004],
                   'month': [1, 2, 2, 1, 1, 2],
                   'AMT': [100, 200, 100, 100, 300, 100]})

# STEP 1: sum by year + month
df2 = df.groupby(['year', 'month']).sum().reset_index()

# STEP 2: filter for max by year
res = df2[df2['AMT'] == df2.groupby(['year'])['AMT'].transform('max')]

print(res)

   year  month  AMT
1  2003      2  300
2  2004      1  400

Upvotes: 1

Related Questions