Reputation: 12826
Suppose I have a dataframe with columns a
, b
and c
. I want to sort the dataframe by column b
in ascending order, and by column c
in descending order. How do I do this?
Upvotes: 527
Views: 633044
Reputation: 23111
sort_values
has a stable sorting option which can be invoking by passing kind='stable'
. Note that we need to reverse the columns to sort by to use the stable sorting correctly.
So the following two methods produce the same output, i.e. df1
and df2
are equivalent.
df = pd.DataFrame(np.random.randint(10, size=(100,2)), columns=['a', 'b'])
df1 = df.sort_values(['a', 'b'], ascending=[True, False]) # sort by 'a' then 'b'
df2 = (
df
.sort_values('b', ascending=False) # sort by 'b' first
.sort_values('a', ascending=True, kind='stable') # then by 'a'
)
assert df1.eq(df2).all().all()
This is especially useful if you need a bit more involved sorting key.
Say, given df
below, you want to sort by 'date'
and 'value'
but treat 'date'
like datetime values even though they are strings. A straightforward sort_values
with two sort by columns would produce a wrong result; however, calling sort_values
twice with the relevant sorting key would produce the correct output.
df = pd.DataFrame({'date': ['10/1/2024', '10/1/2024', '2/23/2024'], 'value': [0, 1, 0]})
df1 = df.sort_values(['date', 'value'], ascending=[True, False]) # <--- wrong output
df2 = (
df
.sort_values('value', ascending=False)
.sort_values('date', ascending=True, kind='stable', key=pd.to_datetime)
) # <--- correct output
N.B. We can get the same output by assigning a new datetime column and use it as a sort-by column but IMO, the stable sort with the sorting key is much cleaner.
df3 = df.assign(dummy=pd.to_datetime(df['date'])).sort_values(['dummy', 'value'], ascending=[True, False]).drop(columns='dummy')
Upvotes: 6
Reputation: 2137
For those that come here for multi-column DataFrame
, use tuple with elements corresponding to each level
.
tuple with elements corresponding to each level:
d = {}
d['first_level'] = pd.DataFrame(columns=['idx', 'a', 'b', 'c'],
data=[[10, 0.89, 0.98, 0.31],
[20, 0.34, 0.78, 0.34]]).set_index('idx')
d['second_level'] = pd.DataFrame(columns=['idx', 'a', 'b', 'c'],
data=[[10, 0.29, 0.63, 0.99],
[20, 0.23, 0.26, 0.98]]).set_index('idx')
df = pd.concat(d, axis=1)
df.sort_values(('second_level', 'b'))
Upvotes: 0
Reputation: 164673
For large dataframes of numeric data, you may see a significant performance improvement via numpy.lexsort
, which performs an indirect sort using a sequence of keys:
import pandas as pd
import numpy as np
np.random.seed(0)
df1 = pd.DataFrame(np.random.randint(1, 5, (10,2)), columns=['a','b'])
df1 = pd.concat([df1]*100000)
def pdsort(df1):
return df1.sort_values(['a', 'b'], ascending=[True, False])
def lex(df1):
arr = df1.values
return pd.DataFrame(arr[np.lexsort((-arr[:, 1], arr[:, 0]))])
assert (pdsort(df1).values == lex(df1).values).all()
%timeit pdsort(df1) # 193 ms per loop
%timeit lex(df1) # 143 ms per loop
One peculiarity is that the defined sorting order with numpy.lexsort
is reversed: (-'b', 'a')
sorts by series a
first. We negate series b
to reflect we want this series in descending order.
Be aware that np.lexsort
only sorts with numeric values, while pd.DataFrame.sort_values
works with either string or numeric values. Using np.lexsort
with strings will give: TypeError: bad operand type for unary -: 'str'
.
Upvotes: 17
Reputation: 375485
As of the 0.17.0 release, the sort
method was deprecated in favor of sort_values
. sort
was completely removed in the 0.20.0 release. The arguments (and results) remain the same:
df.sort_values(['a', 'b'], ascending=[True, False])
You can use the ascending argument of sort
:
df.sort(['a', 'b'], ascending=[True, False])
For example:
In [11]: df1 = pd.DataFrame(np.random.randint(1, 5, (10,2)), columns=['a','b'])
In [12]: df1.sort(['a', 'b'], ascending=[True, False])
Out[12]:
a b
2 1 4
7 1 3
1 1 2
3 1 2
4 3 2
6 4 4
0 4 3
9 4 3
5 4 1
8 4 1
As commented by @renadeen
Sort isn't in place by default! So you should assign result of the sort method to a variable or add inplace=True to method call.
that is, if you want to reuse df1 as a sorted DataFrame:
df1 = df1.sort(['a', 'b'], ascending=[True, False])
or
df1.sort(['a', 'b'], ascending=[True, False], inplace=True)
Upvotes: 917
Reputation: 9768
As of pandas 0.17.0, DataFrame.sort()
is deprecated, and set to be removed in a future version of pandas. The way to sort a dataframe by its values is now is DataFrame.sort_values
As such, the answer to your question would now be
df.sort_values(['b', 'c'], ascending=[True, False], inplace=True)
Upvotes: 92