slaw
slaw

Reputation: 6869

Python Pandas Groupby Behavior

I am trying to understand/use the groupby in Pandas to obtain the maximum count within each group and then return the row with the max count

df = pd.DataFrame({'name':['alpha','beta','gamma','delta','epsilon'], 'year':[2000,2001,2000,2001,2000], 'count':[1,25,11,20,8]})
df


    count   name    year
0   1       alpha   2000
1   25      beta    2001
2   11      gamma   2000
3   20      delta   2001
4   8       epsilon 2000

df.groupby('year').max()

        count   name
year        
2000    11      gamma
2001    25      delta

I am grouping by year but I'd like the output to look like:

        count   name
year        
2000    11      gamma
2001    25      beta

So, I'd like to return the entire row from each group (by 'year') based on the maximum count. It also isn't clear to me why the year 2001 returns "delta" in the name rather than "beta" (as I would expect).

Upvotes: 1

Views: 354

Answers (2)

unutbu
unutbu

Reputation: 879291

You can use idxmax to collect the index of the maximum row, and then use df.iloc to select the row:

In [73]: df.iloc[df.groupby('year')['count'].idxmax()].set_index('year')
Out[73]: 
      count   name
year              
2000     11  gamma
2001     25   beta

Note that df.groupby('year').max() is finding the max for each column -- both the count and the name. That is why delta was getting selected as a max in the name column.

If you are only interested in the max count, then use

df.groupby('year')['count'].max()

df.groupby('year') is a DataFrameGroupBy object. df.groupby('year')['count'] is a SeriesGroupBy object. The call to max or idxmax on the SeriesGroupBy object uses values only from the count column. This method of restricting the GroupBy object is documented here.

Upvotes: 1

bwb
bwb

Reputation: 1

You can use the general groupby().apply and return the rows whose count is equal to the group's maximum count. As in the other answer, the set_index is used to get rid of the extra year column.

>>> df = pd.DataFrame({'name':['alpha','beta','gamma','delta','epsilon'], 'year':[2000,2001,2000,2001,2000], 'count':[1,25,11,20,8]})

>>> df.groupby("year").apply(lambda g: g[g["count"] == g["count"].max()]).set_index("year")
      count   name
year              
2000     11  gamma
2001     25   beta

Upvotes: 0

Related Questions