Pierre Avérous
Pierre Avérous

Reputation: 103

Order DataFrame by row count of col1, and category

I want to order this DataFrame by a given column field and the number of entries I have for this given field.

So let's say I have a very simple dataframe, looking something like this:

      name  age
0     Paul   12
1     Ryan   17
2  Michael  100
3     Paul   36
4     Paul   66
5  Michael   45

What I want as a result is something like

      name  age
0     Paul   12
1     Paul   36
2     Paul   66
3  Michael  100
4  Michael   45
5     Ryan   17

So I have 3 Paul's, so they come up first, then 2 Michael's, and finally only 1 Ryan.

Upvotes: 0

Views: 66

Answers (3)

Gustavo Gradvohl
Gustavo Gradvohl

Reputation: 712

The only change I added was the ability to sort by count of name, and by age.

    df['name_count'] = df['name'].map(df['name'].value_counts())
    df = df.sort_values(by=['name_count', 'age'], 
                        ascending=[False,True]).drop('name_count', axis=1)
    df.reset_index(drop=True)


        name    age
      0 Paul    12
      1 Paul    36
      2 Paul    66
      3 Michael 45
      4 Michael 100
      5 Ryan    17

Upvotes: 0

ALollz
ALollz

Reputation: 59529

Need to create a helper column to sort, in this case the size of the name groups. Add a .reset_index(drop=True) if you prefer a brand new RangeIndex, or keep as is if the original Index is useful.

Sorting does not change the ordering within equal values, so the first 'Paul' row will always appear first within 'Paul'

(df.assign(s = df.groupby('name').name.transform('size'))
   .sort_values('s', ascending=False)
   .drop(columns='s'))

Output

      name  age
0     Paul   12
3     Paul   36
4     Paul   66
2  Michael  100
5  Michael   45
1     Ryan   17

To allay fears raised in comments, this method is performant. Much more so than the above method. Plus you don't ruin your initial index.

import numpy as np
np.random.seed(42)
N = 10**6
df = pd.DataFrame({'name': np.random.randint(1, 10000, N),
                   'age': np.random.normal(0, 1, N)})

%%timeit 
(df.assign(s = df.groupby('name').name.transform('size'))
   .sort_values('s', ascending=False)
   .drop(columns='s'))
#500 ms ± 31.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit 
x = list(df['name'].value_counts().index)
df.set_index('name').loc[x].reset_index()
#2.67 s ± 166 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Upvotes: 3

Brendan
Brendan

Reputation: 4011

One option: use value_counts to get the most frequent names, then set, sort, and reset the index:

x = list(df['name'].value_counts().index)
df.set_index('name').loc[x].reset_index()

returns

      name  age
0     Paul   12
1     Paul   36
2     Paul   66
3  Michael  100
4  Michael   45
5     Ryan   17

Upvotes: 4

Related Questions