Reputation: 31
I have a table that looks similar to this:
user_id | date | count |
---|---|---|
1 | 2020 | 5 |
2021 | 7 | |
2 | 2017 | 1 |
3 | 2020 | 2 |
2019 | 1 | |
2021 | 3 |
I'm trying to keep only the row for each user_id that has the greatest count so it should look like something like this:
user_id | date | count |
---|---|---|
1 | 2021 | 7 |
2 | 2017 | 1 |
3 | 2021 | 3 |
I've tried using df.groupby(level=0).apply(max) but it removes the date column from the final table and I'm not sure how to modify that to keep all three original columns
Upvotes: 3
Views: 328
Reputation: 23217
You can try to specify only column count
after .groupby()
and then use .apply()
to generate the boolean series whether the current entry in a group is equal to max count
in group. Then, use .loc
to locate the boolean series and display the whole dataframe.
df.loc[df.groupby(level=0)['count'].apply(lambda x: x == x.max())]
Result:
date count
user_id
1 2021 7
2 2017 1
3 2021 3
Note that if there are multiple entries in one user_id
that have the same greatest count, all these entries will be kept.
In case for such multiple entries with greatest count you want to keep only one entry per user_id
, you can use the following logics instead:
df1 = df.reset_index()
df1.loc[df1.groupby('user_id')['count'].idxmax()].set_index('user_id')
Result:
date count
user_id
1 2021 7
2 2017 1
3 2021 3
Note that we cannot simply use df.loc[df.groupby(level=0)["count"].idxmax()]
because user_id
is the row index. This code only gives you all unfiltered rows just like the original dataframe unprocessed. This is because the index that idxmax()
returns in this code is the user_id
itself (instead of simple RangeIndex 0, 1, 2, ...etc). Then, when .loc
locates these user_id
index, it will simply return all entries under the same user_id
.
Let's add more entries to the sample data and see the differences between the 2 solutions:
Our base df
(user_id
is the row index):
date count
user_id
1 2018 7 <=== max1
1 2020 5
1 2021 7 <=== max2
2 2017 1
3 2020 3 <=== max1
3 2019 1
3 2021 3 <=== max2
1st Solution result:
df.loc[df.groupby(level=0)['count'].apply(lambda x: x == x.max())]
date count
user_id
1 2018 7
1 2021 7
2 2017 1
3 2020 3
3 2021 3
2nd Solution result:
df1 = df.reset_index()
df1.loc[df1.groupby('user_id')['count'].idxmax()].set_index('user_id')
date count
user_id
1 2018 7
2 2017 1
3 2020 3
Upvotes: 1