user12850579
user12850579

Reputation: 23

Sorting a pandas dataframe based on number of values of a categorical column

The sample dataset looks like this

col1 col2 col3
A 1 as
A 2 sd
B 3 df
C 5 fg
D 6 gh
A 1 hj
B 3 jk
B 4 kt
A 1 re
C 5 we
D 6 qw
D 7 aa

I want to sort the column col1 based on the number of occurences each item has, e.g. A has 4 occurences, B and D have 3 and C has 2 occurences. The dataframe should be sorted like A,A,A,A,B,B,B,D,D,D,C,C so that

Is there a way to achieve the same? Can I use sort_values to get desired result?

Upvotes: 2

Views: 1158

Answers (2)

PieCot
PieCot

Reputation: 3639

You can use sort_values, but you have to provide a callable to the key. From documentation:

Apply the key function to the values before sorting. This is similar to the key argument in the builtin sorted() function, with the notable difference that this key function should be vectorized. It should expect a Series and return a Series with the same shape as the input. It will be applied to each column in by independently.

In your case, the key function has to count the number of times each value appears in col1.

df.sort_values(by='col1', key=lambda x: [((df.col1 == y).sum(), -ord(y)) for y in x], ascending=False)

The tuple ((df.col1 == y).sum(), -ord(y)) is used to sort letters that have the same number of occurrences, using the integer representing their Unicode character.

If your dataframe is large, you should precompute these values, using values_counts and map:

df.sort_values(by='col1', key=lambda x: df.col1.map({k: (v, -ord(k)) for k,v in df.col1.value_counts().to_dict().items()}), ascending=False)

Here the result:

   col1  col2 col3
0     A     1   as
1     A     2   sd
5     A     1   hj
8     A     1   re
2     B     3   df
6     B     3   jk
7     B     4   kt
4     D     6   gh
10    D     6   qw
11    D     7   aa
3     C     5   fg
9     C     5   we

Upvotes: 1

jezrael
jezrael

Reputation: 862591

Create helper column by Series.map with Series.value_counts and use it for sorting with col1 by DataFrame.sort_values:

df['new'] = df['col1'].map(df['col1'].value_counts())
#alternative
#df['new'] = df.groupby('col1')['col1'].transform('count')

df1 = df.sort_values(['new','col1'], ascending=[False, True]).drop('new', axis=1)

One line solution:

df1 = (df.assign(new =df['col1'].map(df['col1'].value_counts()))
         .sort_values(['new','col1'], ascending=[False, True])
         .drop('new', axis=1))

print (df1)
   col1  col2 col3
0     A     1   as
1     A     2   sd
5     A     1   hj
8     A     1   re
2     B     3   df
6     B     3   jk
7     B     4   kt
4     D     6   gh
10    D     6   qw
11    D     7   aa
3     C     5   fg
9     C     5   we

Upvotes: 5

Related Questions