Reputation: 1201
I have a dataframe which was created via a df.pivot
:
type start end
F_Type to_date
A 20150908143000 345 316
B 20150908140300 NaN 480
20150908140600 NaN 120
20150908143000 10743 8803
C 20150908140100 NaN 1715
20150908140200 NaN 1062
20150908141000 NaN 145
20150908141500 418 NaN
20150908141800 NaN 450
20150908142900 1973 1499
20150908143000 19522 16659
D 20150908143000 433 65
E 20150908143000 7290 7375
F 20150908143000 0 0
G 20150908143000 1796 340
I would like to filter and return a single row for each 'F_TYPE' only returning the row with the maximum 'to_date'. I would like to return the following dataframe:
type start end
F_Type to_date
A 20150908143000 345 316
B 20150908143000 10743 8803
C 20150908143000 19522 16659
D 20150908143000 433 65
E 20150908143000 7290 7375
F 20150908143000 0 0
G 20150908143000 1796 340
Upvotes: 43
Views: 70964
Reputation: 23011
A much more concise solution is to sort the index and call groupby.tail
.
new_df = df.sort_index().groupby(level='F_Type').tail(1)
Note that in the OP, F_Type
and to_date
are indices. If they were columns, use sort_values
instead and call groupby.tail
:
new_df = df.sort_values(['F_Type', 'to_date']).groupby('F_Type').tail(1)
Upvotes: 6
Reputation: 292
The other ways to do that are as follow:
(
df
.groupby(level=0)
.apply(lambda group: group.nlargest(1, columns='to_date'))
.reset_index(level=-1, drop=True)
)
(
df
.groupby(level=0)
.apply(lambda group: group.loc[group['to_date'] == group['to_date'].max()])
.reset_index(level=-1, drop=True)
)
Upvotes: 17
Reputation: 879073
A standard approach is to use groupby(keys)[column].idxmax()
.
However, to select the desired rows using idxmax
you need idxmax
to return unique index values. One way to obtain a unique index is to call reset_index
.
Once you obtain the index values from groupby(keys)[column].idxmax()
you can then select the entire row using df.loc
:
In [20]: df.loc[df.reset_index().groupby(['F_Type'])['to_date'].idxmax()]
Out[20]:
start end
F_Type to_date
A 20150908143000 345 316
B 20150908143000 10743 8803
C 20150908143000 19522 16659
D 20150908143000 433 65
E 20150908143000 7290 7375
F 20150908143000 0 0
G 20150908143000 1796 340
Note: idxmax
returns index labels, not necessarily ordinals. After using reset_index
the index labels happen to also be ordinals, but since idxmax
is returning labels (not ordinals) it is better to always use idxmax
in conjunction with df.loc
, not df.iloc
(as I originally did in this post.)
Upvotes: 97