javier
javier

Reputation: 21

pandas dataframe sort_values on multiple columns did not work all the time

I'm having this 2 pandas dataframes, I tried to sort them the same way, but getting different result

import pandas as pd

df = pd.DataFrame([{'a':'4','b':'23'},{'a':'3', 'b':'10'},{'a':'2','b':'88'},{'a':'2', 'b':'29'},{'a':'2','b':'12'},{'a':'1','b':'284'}, {'a':'1','b':'24'}, {'a':'3','b':'1'}])

print(df.sort_values(['a', 'b']))

df1 = pd.DataFrame([{'c1':'3','c2':'10'},{'c1':'2', 'c2':'30'},{'c1':'1','c2':'20'},{'c1':'2','c2':'15'},{'c1':'2','c2':'100'},{'c1':'1','c2':'11'}])

print(df1.sort_values(['c1', 'c2']))

the first one returns, which is expected

   a    b
6  1   24
5  1  284
4  2   12
3  2   29
2  2   88
7  3    1
1  3   10
0  4   23

the 2nd one returns

   c1   c2
5  1   11
2  1   20
4  2  100
3  2   15
1  2   30
0  3   10

I would think it would return

  c1   c2
5  1   11
2  1   20
4  2   15
3  2   30
1  2  100
0  3   10

can someone tell me why they behave differently? Thanks.

Upvotes: 2

Views: 3784

Answers (2)

jalopezp
jalopezp

Reputation: 839

Sounds like you want to sort a dataframe with string columns. You want to sort the values as if they were ints but want to keep the datatype of the column as string for some reason.

The best way to keep the datatype intact but sort as integer is to use the key parameter in sort_values():

df1 = pd.DataFrame(
        [{'c1':'3', 'c2':'10'},
         {'c1':'2', 'c2':'30'},
         {'c1':'1', 'c2':'20'},
         {'c1':'2', 'c2':'15'},
         {'c1':'2', 'c2':'100'},
         {'c1':'1', 'c2':'11'}])


df1.sort_values(['c1', 'c2'], key=lambda s: s.astype(int))

The key parameter is applied to every column you're sorting by, so if you have an additional column that you also want to sort by but cannot convert to int, it will fail. The astype() function does let you handle that case:

df2 = pd.DataFrame(
        [{'c1':'3', 'c2':'10', 'c3':'a'},
         {'c1':'2', 'c2':'15', 'c3':'a'},
         {'c1':'1', 'c2':'20', 'c3':'b'},
         {'c1':'2', 'c2':'15', 'c3':'b'},
         {'c1':'2', 'c2':'100', 'c3':'a'},
         {'c1':'1', 'c2':'20', 'c3':'a'}])

(df2
    .sort_values(['c1', 'c2'], key=lambda s: s.astype(int, errors='ignore'))
)

More generally, if you have multiple columns with different conversions, then you have to call sort_values multiple times:

df3 = pd.DataFrame(
        [{'c1':'3', 'c2':'10', 'c3':'a', 'c4':'1/6/2023'},
         {'c1':'2', 'c2':'15', 'c3':'b', 'c4':'2/6/2023'},
         {'c1':'1', 'c2':'20', 'c3':'b', 'c4':'1/6/2023'},
         {'c1':'2', 'c2':'15', 'c3':'a', 'c4':'4/6/2023'},
         {'c1':'2', 'c2':'15', 'c3':'a', 'c4':'3/6/2023'},
         {'c1':'2', 'c2':'100', 'c3':'a', 'c4':'2/6/2023'},
         {'c1':'1', 'c2':'20', 'c3':'a', 'c4':'3/6/2023'}])
 
(df3
    .sort_values('c4', key=lambda s: s.astype('datetime64[ns]'))
    .sort_values('c3')
    .sort_values(['c1', 'c2'], key=lambda s: s.astype(int))
)

Upvotes: 0

jezrael
jezrael

Reputation: 863351

There is problem your data are strings, not numbers, so are sorted alphabetically.

Solution is convert column to int:

df1['c2'] = df1['c2'].astype(int)
print(df1.sort_values(['c1', 'c2']))
  c1   c2
5  1   11
2  1   20
3  2   15
1  2   30
4  2  100
0  3   10

Upvotes: 3

Related Questions