Reputation: 1495
How does one return all observations from a dataframe that only hold the max value for each unique key associated? I tried groupby and max but for every difference I get more than on value per key returned back. See example below:
import pandas as pd
key = [111,111,222,333,444,555]
flag = [0,1,1,1,1,1]
date = [pd.to_datetime('2020-01-01'),pd.to_datetime('2020-01-02'),
pd.to_datetime('2020-02-01'),pd.to_datetime('2020-04-01'),
pd.to_datetime('2020-03-01'),pd.to_datetime('2020-05-01')]
df_dic = {'key':key, 'flag':flag, 'date':date}
df = pd.DataFrame(df_dic)
df
df.groupby(['key', 'flag']).agg({'date':'max'}).reset_index()
This returns:
key flag date
0 111 0 2020-01-01
1 111 1 2020-01-02
2 222 1 2020-02-01
3 333 1 2020-04-01
I want to return only the observations with the max date per unique key like so:
key flag date
0 111 1 2020-01-02
1 222 1 2020-02-01
2 333 1 2020-04-01
Upvotes: 1
Views: 232
Reputation: 29635
you can avoid groupby
and use sort_values
and drop_dplicates
.
print(df.sort_values('date').drop_duplicates('key', keep='last'))
key flag date
1 111 1 2020-01-02
2 222 1 2020-02-01
4 444 1 2020-03-01
3 333 1 2020-04-01
5 555 1 2020-05-01
Note that you can add the key column in the sort_values
if the order matters to you in the result df.sort_values(['key', 'date'])...
Upvotes: 1
Reputation: 9257
One possible solution (although maybe not the best one) is to use an inner join after the groupby
method. This ensures that the flags are taken from the original dataframe, either they are 0 or 1.
df.groupby(['key'])\
.agg({'date': 'max'})\
.reset_index()\
.merge(df, how='inner', on=['key', 'date'])
# key date flag
# 0 111 2020-01-02 1
# 1 222 2020-02-01 1
# 2 333 2020-04-01 1
# 3 444 2020-03-01 1
# 4 555 2020-05-01 1
Upvotes: 1