user636322
user636322

Reputation: 1201

Select row by max value in group in a pandas dataframe

I have a dataframe which was created via a df.pivot:

type                             start  end
F_Type         to_date                     
A              20150908143000    345    316
B              20150908140300    NaN    480
               20150908140600    NaN    120
               20150908143000  10743   8803
C              20150908140100    NaN   1715
               20150908140200    NaN   1062
               20150908141000    NaN    145
               20150908141500    418    NaN
               20150908141800    NaN    450
               20150908142900   1973   1499
               20150908143000  19522  16659
D              20150908143000    433     65
E              20150908143000   7290   7375
F              20150908143000      0      0
G              20150908143000   1796    340

I would like to filter and return a single row for each 'F_TYPE' only returning the row with the maximum 'to_date'. I would like to return the following dataframe:

type                             start  end
F_Type         to_date                     
A              20150908143000    345    316
B              20150908143000  10743   8803
C              20150908143000  19522  16659
D              20150908143000    433     65
E              20150908143000   7290   7375
F              20150908143000      0      0
G              20150908143000   1796    340

Upvotes: 43

Views: 70964

Answers (3)

cottontail
cottontail

Reputation: 23011

A much more concise solution is to sort the index and call groupby.tail.

new_df = df.sort_index().groupby(level='F_Type').tail(1)

Note that in the OP, F_Type and to_date are indices. If they were columns, use sort_values instead and call groupby.tail:

new_df = df.sort_values(['F_Type', 'to_date']).groupby('F_Type').tail(1)

result

Upvotes: 6

Sin-seok Seo
Sin-seok Seo

Reputation: 292

The other ways to do that are as follow:

  1. If you want only one max row per group.
(
    df
    .groupby(level=0)
    .apply(lambda group: group.nlargest(1, columns='to_date'))
    .reset_index(level=-1, drop=True)
)
  1. If you want to get all rows that are equal to max per group.
(
    df
    .groupby(level=0)
    .apply(lambda group: group.loc[group['to_date'] == group['to_date'].max()])
    .reset_index(level=-1, drop=True)
)

Upvotes: 17

unutbu
unutbu

Reputation: 879073

A standard approach is to use groupby(keys)[column].idxmax(). However, to select the desired rows using idxmax you need idxmax to return unique index values. One way to obtain a unique index is to call reset_index.

Once you obtain the index values from groupby(keys)[column].idxmax() you can then select the entire row using df.loc:

In [20]: df.loc[df.reset_index().groupby(['F_Type'])['to_date'].idxmax()]
Out[20]: 
                       start    end
F_Type to_date                     
A      20150908143000    345    316
B      20150908143000  10743   8803
C      20150908143000  19522  16659
D      20150908143000    433     65
E      20150908143000   7290   7375
F      20150908143000      0      0
G      20150908143000   1796    340

Note: idxmax returns index labels, not necessarily ordinals. After using reset_index the index labels happen to also be ordinals, but since idxmax is returning labels (not ordinals) it is better to always use idxmax in conjunction with df.loc, not df.iloc (as I originally did in this post.)

Upvotes: 97

Related Questions