Reputation: 23
The sample dataset looks like this
col1 | col2 | col3 |
---|---|---|
A | 1 | as |
A | 2 | sd |
B | 3 | df |
C | 5 | fg |
D | 6 | gh |
A | 1 | hj |
B | 3 | jk |
B | 4 | kt |
A | 1 | re |
C | 5 | we |
D | 6 | qw |
D | 7 | aa |
I want to sort the column col1 based on the number of occurences each item has, e.g. A has 4 occurences, B and D have 3 and C has 2 occurences. The dataframe should be sorted like A,A,A,A,B,B,B,D,D,D,C,C so that
Is there a way to achieve the same? Can I use sort_values to get desired result?
Upvotes: 2
Views: 1158
Reputation: 3639
You can use sort_values
, but you have to provide a callable to the key
. From documentation:
Apply the key function to the values before sorting. This is similar to the key argument in the builtin sorted() function, with the notable difference that this key function should be vectorized. It should expect a Series and return a Series with the same shape as the input. It will be applied to each column in by independently.
In your case, the key function has to count the number of times each value appears in col1
.
df.sort_values(by='col1', key=lambda x: [((df.col1 == y).sum(), -ord(y)) for y in x], ascending=False)
The tuple ((df.col1 == y).sum(), -ord(y))
is used to sort letters that have the same number of occurrences, using the integer representing their Unicode character.
If your dataframe is large, you should precompute these values, using values_counts
and map
:
df.sort_values(by='col1', key=lambda x: df.col1.map({k: (v, -ord(k)) for k,v in df.col1.value_counts().to_dict().items()}), ascending=False)
Here the result:
col1 col2 col3
0 A 1 as
1 A 2 sd
5 A 1 hj
8 A 1 re
2 B 3 df
6 B 3 jk
7 B 4 kt
4 D 6 gh
10 D 6 qw
11 D 7 aa
3 C 5 fg
9 C 5 we
Upvotes: 1
Reputation: 862591
Create helper column by Series.map
with Series.value_counts
and use it for sorting with col1
by DataFrame.sort_values
:
df['new'] = df['col1'].map(df['col1'].value_counts())
#alternative
#df['new'] = df.groupby('col1')['col1'].transform('count')
df1 = df.sort_values(['new','col1'], ascending=[False, True]).drop('new', axis=1)
One line solution:
df1 = (df.assign(new =df['col1'].map(df['col1'].value_counts()))
.sort_values(['new','col1'], ascending=[False, True])
.drop('new', axis=1))
print (df1)
col1 col2 col3
0 A 1 as
1 A 2 sd
5 A 1 hj
8 A 1 re
2 B 3 df
6 B 3 jk
7 B 4 kt
4 D 6 gh
10 D 6 qw
11 D 7 aa
3 C 5 fg
9 C 5 we
Upvotes: 5