Reputation: 21

pandas dataframe sort_values on multiple columns did not work all the time

I'm having this 2 pandas dataframes, I tried to sort them the same way, but getting different result

import pandas as pd

df = pd.DataFrame([{'a':'4','b':'23'},{'a':'3', 'b':'10'},{'a':'2','b':'88'},{'a':'2', 'b':'29'},{'a':'2','b':'12'},{'a':'1','b':'284'}, {'a':'1','b':'24'}, {'a':'3','b':'1'}])

print(df.sort_values(['a', 'b']))

df1 = pd.DataFrame([{'c1':'3','c2':'10'},{'c1':'2', 'c2':'30'},{'c1':'1','c2':'20'},{'c1':'2','c2':'15'},{'c1':'2','c2':'100'},{'c1':'1','c2':'11'}])

print(df1.sort_values(['c1', 'c2']))

the first one returns, which is expected

the 2nd one returns

I would think it would return

can someone tell me why they behave differently? Thanks.

Upvotes: 2

Answers (2)

jalopezp

Reputation: 839

Sounds like you want to sort a dataframe with string columns. You want to sort the values as if they were ints but want to keep the datatype of the column as string for some reason.

The best way to keep the datatype intact but sort as integer is to use the key parameter in sort_values():

df1 = pd.DataFrame(
        [{'c1':'3', 'c2':'10'},
         {'c1':'2', 'c2':'30'},
         {'c1':'1', 'c2':'20'},
         {'c1':'2', 'c2':'15'},
         {'c1':'2', 'c2':'100'},
         {'c1':'1', 'c2':'11'}])


df1.sort_values(['c1', 'c2'], key=lambda s: s.astype(int))

The key parameter is applied to every column you're sorting by, so if you have an additional column that you also want to sort by but cannot convert to int, it will fail. The astype() function does let you handle that case:

df2 = pd.DataFrame(
        [{'c1':'3', 'c2':'10', 'c3':'a'},
         {'c1':'2', 'c2':'15', 'c3':'a'},
         {'c1':'1', 'c2':'20', 'c3':'b'},
         {'c1':'2', 'c2':'15', 'c3':'b'},
         {'c1':'2', 'c2':'100', 'c3':'a'},
         {'c1':'1', 'c2':'20', 'c3':'a'}])

(df2
    .sort_values(['c1', 'c2'], key=lambda s: s.astype(int, errors='ignore'))
)

More generally, if you have multiple columns with different conversions, then you have to call sort_values multiple times:

df3 = pd.DataFrame(
        [{'c1':'3', 'c2':'10', 'c3':'a', 'c4':'1/6/2023'},
         {'c1':'2', 'c2':'15', 'c3':'b', 'c4':'2/6/2023'},
         {'c1':'1', 'c2':'20', 'c3':'b', 'c4':'1/6/2023'},
         {'c1':'2', 'c2':'15', 'c3':'a', 'c4':'4/6/2023'},
         {'c1':'2', 'c2':'15', 'c3':'a', 'c4':'3/6/2023'},
         {'c1':'2', 'c2':'100', 'c3':'a', 'c4':'2/6/2023'},
         {'c1':'1', 'c2':'20', 'c3':'a', 'c4':'3/6/2023'}])
 
(df3
    .sort_values('c4', key=lambda s: s.astype('datetime64[ns]'))
    .sort_values('c3')
    .sort_values(['c1', 'c2'], key=lambda s: s.astype(int))
)

Upvotes: 0

jezrael

Reputation: 863351

There is problem your data are strings, not numbers, so are sorted alphabetically.

Solution is convert column to int:

df1['c2'] = df1['c2'].astype(int)
print(df1.sort_values(['c1', 'c2']))
  c1   c2
5  1   11
2  1   20
3  2   15
1  2   30
4  2  100
0  3   10

Upvotes: 3

pandas dataframe sort_values on multiple columns did not work all the time

Answers (2)

Related Questions