I have a table that looks similar to this: user_id date count 1 2020 5 2021 7 2 2017 1 3 2020 2 2019 1 2021 3 I'm trying to keep only the row for each user_id that has the greatest count so it should look like something like this: user_id date count 1 2021 7 2 2017 1 3 2021 3 I've tried using df.groupby(level=0).apply(max) but it removes the date column from the final table and I'm not sure how to modify that to keep all three original columns

Reputation: 31

Getting max row from multi-index table

I have a table that looks similar to this:

user_id	date	count
1	2020	5
	2021	7
2	2017	1
3	2020	2
	2019	1
	2021	3

I'm trying to keep only the row for each user_id that has the greatest count so it should look like something like this:

user_id	date	count
1	2021	7
2	2017	1
3	2021	3

I've tried using df.groupby(level=0).apply(max) but it removes the date column from the final table and I'm not sure how to modify that to keep all three original columns

Upvotes: 3

Answers (1)

SeaBean

Reputation: 23217

You can try to specify only column count after .groupby() and then use .apply() to generate the boolean series whether the current entry in a group is equal to max count in group. Then, use .loc to locate the boolean series and display the whole dataframe.

df.loc[df.groupby(level=0)['count'].apply(lambda x: x == x.max())]

Result:

         date  count
user_id             
1        2021      7
2        2017      1
3        2021      3

Note that if there are multiple entries in one user_id that have the same greatest count, all these entries will be kept.

In case for such multiple entries with greatest count you want to keep only one entry per user_id, you can use the following logics instead:

df1 = df.reset_index()
df1.loc[df1.groupby('user_id')['count'].idxmax()].set_index('user_id')

Result:

         date  count
user_id             
1        2021      7
2        2017      1
3        2021      3

Note that we cannot simply use df.loc[df.groupby(level=0)["count"].idxmax()] because user_id is the row index. This code only gives you all unfiltered rows just like the original dataframe unprocessed. This is because the index that idxmax() returns in this code is the user_id itself (instead of simple RangeIndex 0, 1, 2, ...etc). Then, when .loc locates these user_id index, it will simply return all entries under the same user_id.

Demo

Let's add more entries to the sample data and see the differences between the 2 solutions:

Our base df (user_id is the row index):

         date  count
user_id             
1        2018      7                 <=== max1
1        2020      5
1        2021      7                 <=== max2
2        2017      1
3        2020      3                 <=== max1
3        2019      1
3        2021      3                 <=== max2

1st Solution result:

df.loc[df.groupby(level=0)['count'].apply(lambda x: x == x.max())]


         date  count
user_id             
1        2018      7
1        2021      7
2        2017      1
3        2020      3
3        2021      3

2nd Solution result:

df1 = df.reset_index()
df1.loc[df1.groupby('user_id')['count'].idxmax()].set_index('user_id')


         date  count
user_id             
1        2018      7
2        2017      1
3        2020      3

Upvotes: 1

Getting max row from multi-index table

Answers (1)

Demo

Related Questions