Reputation: 103
I want to order this DataFrame by a given column field and the number of entries I have for this given field.
So let's say I have a very simple dataframe, looking something like this:
name age
0 Paul 12
1 Ryan 17
2 Michael 100
3 Paul 36
4 Paul 66
5 Michael 45
What I want as a result is something like
name age
0 Paul 12
1 Paul 36
2 Paul 66
3 Michael 100
4 Michael 45
5 Ryan 17
So I have 3 Paul's, so they come up first, then 2 Michael's, and finally only 1 Ryan.
Upvotes: 0
Views: 66
Reputation: 712
The only change I added was the ability to sort by count of name, and by age.
df['name_count'] = df['name'].map(df['name'].value_counts())
df = df.sort_values(by=['name_count', 'age'],
ascending=[False,True]).drop('name_count', axis=1)
df.reset_index(drop=True)
name age
0 Paul 12
1 Paul 36
2 Paul 66
3 Michael 45
4 Michael 100
5 Ryan 17
Upvotes: 0
Reputation: 59529
Need to create a helper column to sort, in this case the size
of the name groups. Add a .reset_index(drop=True)
if you prefer a brand new RangeIndex, or keep as is if the original Index is useful.
Sorting does not change the ordering within equal values, so the first 'Paul'
row will always appear first within 'Paul'
(df.assign(s = df.groupby('name').name.transform('size'))
.sort_values('s', ascending=False)
.drop(columns='s'))
name age
0 Paul 12
3 Paul 36
4 Paul 66
2 Michael 100
5 Michael 45
1 Ryan 17
To allay fears raised in comments, this method is performant. Much more so than the above method. Plus you don't ruin your initial index.
import numpy as np
np.random.seed(42)
N = 10**6
df = pd.DataFrame({'name': np.random.randint(1, 10000, N),
'age': np.random.normal(0, 1, N)})
%%timeit
(df.assign(s = df.groupby('name').name.transform('size'))
.sort_values('s', ascending=False)
.drop(columns='s'))
#500 ms ± 31.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
x = list(df['name'].value_counts().index)
df.set_index('name').loc[x].reset_index()
#2.67 s ± 166 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Upvotes: 3
Reputation: 4011
One option: use value_counts
to get the most frequent names, then set, sort, and reset the index:
x = list(df['name'].value_counts().index)
df.set_index('name').loc[x].reset_index()
returns
name age
0 Paul 12
1 Paul 36
2 Paul 66
3 Michael 100
4 Michael 45
5 Ryan 17
Upvotes: 4