pythonpandaspython-2.7sortingdata-analysis

Reputation: 12826

How to sort a pandas dataFrame by two or more columns?

Suppose I have a dataframe with columns a, b and c. I want to sort the dataframe by column b in ascending order, and by column c in descending order. How do I do this?

Upvotes: 527

Answers (5)

cottontail

Reputation: 23111

sort_values has a stable sorting option which can be invoking by passing kind='stable'. Note that we need to reverse the columns to sort by to use the stable sorting correctly.

So the following two methods produce the same output, i.e. df1 and df2 are equivalent.

df = pd.DataFrame(np.random.randint(10, size=(100,2)), columns=['a', 'b'])

df1 = df.sort_values(['a', 'b'], ascending=[True, False])  # sort by 'a' then 'b'

df2 = (
    df
    .sort_values('b', ascending=False)                     # sort by 'b' first
    .sort_values('a', ascending=True, kind='stable')       # then by 'a'
)

assert df1.eq(df2).all().all()

This is especially useful if you need a bit more involved sorting key.

Say, given df below, you want to sort by 'date' and 'value' but treat 'date' like datetime values even though they are strings. A straightforward sort_values with two sort by columns would produce a wrong result; however, calling sort_values twice with the relevant sorting key would produce the correct output.

df = pd.DataFrame({'date': ['10/1/2024', '10/1/2024', '2/23/2024'], 'value': [0, 1, 0]})

df1 = df.sort_values(['date', 'value'], ascending=[True, False])  # <--- wrong output

df2 = (
    df
    .sort_values('value', ascending=False)
    .sort_values('date', ascending=True, kind='stable', key=pd.to_datetime) 
)  # <--- correct output

N.B. We can get the same output by assigning a new datetime column and use it as a sort-by column but IMO, the stable sort with the sorting key is much cleaner.

df3 = df.assign(dummy=pd.to_datetime(df['date'])).sort_values(['dummy', 'value'], ascending=[True, False]).drop(columns='dummy')

Upvotes: 6

Muhammad Yasirroni

Reputation: 2137

For those that come here for multi-column DataFrame, use tuple with elements corresponding to each level.

tuple with elements corresponding to each level:

d = {}
d['first_level'] = pd.DataFrame(columns=['idx', 'a', 'b', 'c'],
                                         data=[[10, 0.89, 0.98, 0.31],
                                               [20, 0.34, 0.78, 0.34]]).set_index('idx')
d['second_level'] = pd.DataFrame(columns=['idx', 'a', 'b', 'c'],
                                          data=[[10, 0.29, 0.63, 0.99],
                                                [20, 0.23, 0.26, 0.98]]).set_index('idx')

df = pd.concat(d, axis=1)
df.sort_values(('second_level', 'b'))

Upvotes: 0

jpp

Reputation: 164673

For large dataframes of numeric data, you may see a significant performance improvement via numpy.lexsort, which performs an indirect sort using a sequence of keys:

import pandas as pd
import numpy as np

np.random.seed(0)

df1 = pd.DataFrame(np.random.randint(1, 5, (10,2)), columns=['a','b'])
df1 = pd.concat([df1]*100000)

def pdsort(df1):
    return df1.sort_values(['a', 'b'], ascending=[True, False])

def lex(df1):
    arr = df1.values
    return pd.DataFrame(arr[np.lexsort((-arr[:, 1], arr[:, 0]))])

assert (pdsort(df1).values == lex(df1).values).all()

%timeit pdsort(df1)  # 193 ms per loop
%timeit lex(df1)     # 143 ms per loop

One peculiarity is that the defined sorting order with numpy.lexsort is reversed: (-'b', 'a') sorts by series a first. We negate series b to reflect we want this series in descending order.

Be aware that np.lexsort only sorts with numeric values, while pd.DataFrame.sort_values works with either string or numeric values. Using np.lexsort with strings will give: TypeError: bad operand type for unary -: 'str'.

Upvotes: 17

Andy Hayden

Reputation: 375485

As of the 0.17.0 release, the sort method was deprecated in favor of sort_values. sort was completely removed in the 0.20.0 release. The arguments (and results) remain the same:

df.sort_values(['a', 'b'], ascending=[True, False])

You can use the ascending argument of sort:

df.sort(['a', 'b'], ascending=[True, False])

For example:

In [11]: df1 = pd.DataFrame(np.random.randint(1, 5, (10,2)), columns=['a','b'])

In [12]: df1.sort(['a', 'b'], ascending=[True, False])
Out[12]:
   a  b
2  1  4
7  1  3
1  1  2
3  1  2
4  3  2
6  4  4
0  4  3
9  4  3
5  4  1
8  4  1

As commented by @renadeen

Sort isn't in place by default! So you should assign result of the sort method to a variable or add inplace=True to method call.

that is, if you want to reuse df1 as a sorted DataFrame:

df1 = df1.sort(['a', 'b'], ascending=[True, False])

df1.sort(['a', 'b'], ascending=[True, False], inplace=True)

Upvotes: 917

Kyle Heuton

Reputation: 9768

As of pandas 0.17.0, DataFrame.sort() is deprecated, and set to be removed in a future version of pandas. The way to sort a dataframe by its values is now is DataFrame.sort_values

As such, the answer to your question would now be

df.sort_values(['b', 'c'], ascending=[True, False], inplace=True)

Upvotes: 92

How to sort a pandas dataFrame by two or more columns?

Answers (5)

Related Questions