Reputation: 5060

How to sort pandas DataFrame with a key?

I'm looking for a way to sort pandas DataFrame. pd.DataFrame.sort_values doesn't accept a key function, pre-1.0. I can convert it to list and apply a key to sorted function, but that will be slow. The other way seems something related to categorical index. I don't have a fixed number of rows so I don't know if categorical index will be applicable.

I have given an example case of what kind of data I want to sort:

Input DataFrame:

     clouds  fluff
0    {[}      1
1    >>>      2
2     {1      3
3    123      4
4  AAsda      5
5    aad      6

Output DataFrame:

     clouds  fluff
0    >>>      2
1    {[}      1
2     {1      3
3    123      4
4    aad      6
5  AAsda      5

The rule for sorting (priority):

First special characters (sort by ascii among themselves)

Next is by numbers

next is by lower case alphabets (lexicographical)

next is Capital case alphabets (lexicographical)

In plain python I'd do it like

from functools import cmp_to_key

def ks(a, b):
    # "Not exactly this but similar"
    if a.isupper():
        return -1
    else:
        return 1

Case

sorted(['aa', 'AA', 'dd', 'DD'], key=cmp_to_key(ks))

Answer:

['DD', 'AA', 'aa', 'dd']

How would you do it with Pandas?

Upvotes: 11

Answers (4)

wuiover

Reputation: 81

As of pandas 1.2.0, I did this

import numpy as np
import pandas as pd

df = pd.DataFrame(['aa', 'dd', 'DD', 'AA'], columns=["data"])

# This is the sorting rule
rule = {
    "DD": 1,
    "AA": 10,
    "aa": 20,
    "dd": 30,
    }


def particular_sort(series):
    """
    Must return one Series
    """
    return series.apply(lambda x: rule.get(x, 1000))


new_df = df.sort_values(by=["data"], key=particular_sort)
print(new_df)  # DD, AA, aa, dd

Of course, you can do this too, but it may be difficult to understand,smile

new_df = df.sort_values(by=["data"], key=lambda x: x.apply(lambda y: rule.get(y, 1000)))
print(new_df)  # DD, AA, aa, dd

Upvotes: 8

Vasantha Ganesh

Reputation: 5060

As of pandas 1.1.0, pandas.DataFrame.sort_values accepts an argument key with type callable.

So in this case we would use:

df.sort_values(by='clouds', key=kf)

where kf is the key function that operates on type Series. Accepts and returns Series.

Upvotes: 8

Peter

Reputation: 13485

This seems to work:

def sort_dataframe_by_key(dataframe: DataFrame, column: str, key: Callable) -> DataFrame:
    """ Sort a dataframe from a column using the key """
    sort_ixs = sorted(np.arange(len(dataframe)), key=lambda i: key(dataframe.iloc[i][column]))
    return DataFrame(columns=list(dataframe), data=dataframe.iloc[sort_ixs].values)

It passes tests:

def test_sort_dataframe_by_key():
    dataframe = DataFrame([{'a': 1, 'b': 2, 'c': 3}, {'a': 2, 'b': 1, 'c': 1}, {'a': 3, 'b': 4, 'c': 0}])
    assert sort_dataframe_by_key(dataframe, column='a', key=lambda x: x).equals(
           DataFrame([{'a': 1, 'b': 2, 'c': 3}, {'a': 2, 'b': 1, 'c': 1}, {'a': 3, 'b': 4, 'c': 0}]))
    assert sort_dataframe_by_key(dataframe, column='a', key=lambda x: -x).equals(
           DataFrame([{'a': 3, 'b': 4, 'c': 0}, {'a': 2, 'b': 1, 'c': 1}, {'a': 1, 'b': 2, 'c': 3}]))
    assert sort_dataframe_by_key(dataframe, column='b', key=lambda x: -x).equals(
           DataFrame([{'a': 3, 'b': 4, 'c': 0}, {'a': 1, 'b': 2, 'c': 3}, {'a': 2, 'b': 1, 'c': 1}]))
    assert sort_dataframe_by_key(dataframe, column='c', key=lambda x: x).equals(
           DataFrame([{'a': 3, 'b': 4, 'c': 0}, {'a': 2, 'b': 1, 'c': 1}, {'a': 1, 'b': 2, 'c': 3}]))

Upvotes: 2

Khalil Al Hooti

Reputation: 4506

This might be useful, yet still not sure about special characters! can they actally be sorted!!

import pandas as pd

a = [2, 'B', 'c', 1, 'a', 'b',3, 'C', 'A']

df = pd.DataFrame({"a": a})
df['upper'] = df['a'].str.isupper()
df['lower'] = df['a'].str.islower()
df['int'] = df['a'].apply(isinstance,args = [int])

df2 = pd.concat([df[df['int'] == True].sort_values(by=['a']), 
           df[df['lower'] == True].sort_values(by=['a']),
           df[df['upper'] == True].sort_values(by=['a'])])

print(df2)

   a    upper   lower   int
3   1   NaN     NaN     True
0   2   NaN     NaN     True
6   3   NaN     NaN     True
4   a   False   True    False
5   b   False   True    False
2   c   False   True    False
8   A   True    False   False
1   B   True    False   False
7   C   True    False   False

you can also do it in one step with creating new True False columns!

a = [2, 'B', 'c', 1, 'a', 'b',3, 'C', 'A']
df = pd.DataFrame({"a": a})
df2 = pd.concat([df[df['a'].apply(isinstance,args = [int])].sort_values(by=['a']), 
           df[df['a'].str.islower() == True].sort_values(by=['a']),
           df[df['a'].str.isupper() == True].sort_values(by=['a'])])

    a
3   1
0   2
6   3
4   a
5   b
2   c
8   A
1   B
7   C

Upvotes: 2

How to sort pandas DataFrame with a key?

Answers (4)

Related Questions