Reputation: 361
Running this code:
df = pd.DataFrame(['ADc','Abc','AEc'],columns = ['Test'],index=[0,1,2])
df.sort(columns=['Test'],axis=0, ascending=False,inplace=True)
Returns a dataframe column ordered as: [Abc, AEc, ADc]
.
ADc should be before AEc, what's going on?
Upvotes: 11
Views: 5337
Reputation: 42916
Using DataFrame.sort_values
with key
argument since pandas >= 1.1.0
:
We can now pass a custom function of the string or any other custom key in the sort_values method:
df = pd.DataFrame(['ADc','Abc','AEc'],columns = ['Test'],index=[0,1,2])
print(df)
Test
0 ADc
1 Abc
2 AEc
df.sort_values(by="Test", key=lambda x: x.str.lower())
Test
1 Abc
0 ADc
2 AEc
Upvotes: 4
Reputation: 3297
I don't think that's a pandas bug. It seems to be just the way python sorting algorithm works with mixed cased letters (being case sensitive) - look here
Because when you do:
In [1]: l1 = ['ADc','Abc','AEc']
In [2]: l1.sort(reverse=True)
In [3]: l1
Out[3]: ['Abc', 'AEc', 'ADc']
So, since apparently one cannot control the sorting algorithm using the pandas sort method, just use a lower cased version of that column for the sorting and drop it later on:
In [4]: df = pd.DataFrame(['ADc','Abc','AEc'], columns=['Test'], index=[0,1,2])
In [5]: df['test'] = df['Test'].str.lower()
In [6]: df.sort(columns=['test'], axis=0, ascending=True, inplace=True)
In [7]: df.drop('test', axis=1, inplace=True)
In [8]: df
Out[8]:
Test
1 Abc
0 ADc
2 AEc
Note: If you want the column sorted alphabetically, the ascending
argument must be set to True
EDIT:
As DSM suggested, to avoid creating a new helper column, you can do:
df = df.loc[df["Test"].str.lower().order().index]
UPDATE:
As pointed out by weatherfrog, for newer versions of pandas the correct method is .sort_values()
. So the above one-liner becomes:
df = df.loc[df["Test"].str.lower().sort_values().index]
Upvotes: 11
Reputation: 1459
Here is an example of how to sort multiple columns using reindex
, extended from @Zero's answer here. We want to sort the example dataframe first by the second column (SORT_INDEX1
), then the first (SORT_INDEX2
). This example sorts the secondary column (SORT_INDEX2
) using a case-insensitive sort, then the primary column (SORT_INDEX1
) using the default case-sensitive sort.
import pandas as pd
df = pd.DataFrame([['q', '1'],['a', '1'],['B', '1'],['C', '1'],
['q', '0'],['a', '0'],['B', '0'],['C', '0']])
SORT_INDEX1 = 1
SORT_INDEX2 = 0
# Cannot change sorting algorithm used internally by pandas.
df_default = df.sort_values(by=[SORT_INDEX1, SORT_INDEX2])
# Use tuple of (index, value to sort by) to get a list of sorted indices, obtained through unzipping.
df_new = df.reindex(list(zip(*sorted(zip(df.index, df[SORT_INDEX2]), key=lambda t: t[1].lower())))[0])
.sort_values(by=SORT_INDEX1)
print('Original dataframe:')
print(df)
print('Default case-sensitive sort:')
print(df_default)
print('Case-insensitive sort:')
print(df_new)
Output:
Original dataframe:
0 1
0 q 1
1 a 1
2 B 1
3 C 1
4 q 0
5 a 0
6 B 0
7 C 0
Default case-sensitive sort:
0 1
6 B 0
7 C 0
5 a 0
4 q 0
2 B 1
3 C 1
1 a 1
0 q 1
Case-insensitive sort:
0 1
5 a 0
6 B 0
7 C 0
4 q 0
1 a 1
2 B 1
3 C 1
0 q 1
EDIT: Apologies, the second sort does not work properly for larger datasets. The secondary column's order is not preserved. This method will work fine for sorting by one column, but I still need to find a reliable and concise way to sort 2 columns.
Upvotes: 0