Reputation: 495
I have a pandas dataframe as below:
df = pd.DataFrame({
'a': [1, 1, 1, 1, 1, 2, 2, 2, 2],
'b': [3, 2, 1, 4, 2, 1, 2, 2, 1]
})
Which gives me
>>> df
a b
0 1 3
1 1 2
2 1 1
3 1 4
4 1 2
5 2 1
6 2 2
7 2 2
8 2 1
I want to group the dataframe by column a
, and return first N maximum values from column b
of each group, ordered descending by that maximum value.
Let's consider I only want first two maximum values, I've done the following:
df = df.sort_values(['a', 'b'], ascending=False)
df = df.groupby('a').nth([0,1])
df = df.sort_values(['a', 'b'], ascending=False).reset_index()
Which gives me
>>> df
a b
0 2 2
1 2 2
2 1 4
3 1 3
The output Works fine, but it's not the efficient way to do this.
I have also tried nlargest
function, but it's not helping, because it drops other columns and returns only the b
column.
df = df.groupby('a')['b'].nlargest(2)
Which gives me
>>> df
0 2
1 2
2 4
3 3
Name: b, dtype: int64
What is the best way to do this?
Upvotes: 2
Views: 823
Reputation: 71689
Sort
the values by column b
then group
the dataframe and aggregate using head(n)
to select the first n
rows of each group
df.sort_values('b', ascending=False).groupby('a').head(2)
a b
3 1 4
0 1 3
6 2 2
7 2 2
Upvotes: 3