Reputation: 21
I'm having this 2 pandas dataframes, I tried to sort them the same way, but getting different result
import pandas as pd
df = pd.DataFrame([{'a':'4','b':'23'},{'a':'3', 'b':'10'},{'a':'2','b':'88'},{'a':'2', 'b':'29'},{'a':'2','b':'12'},{'a':'1','b':'284'}, {'a':'1','b':'24'}, {'a':'3','b':'1'}])
print(df.sort_values(['a', 'b']))
df1 = pd.DataFrame([{'c1':'3','c2':'10'},{'c1':'2', 'c2':'30'},{'c1':'1','c2':'20'},{'c1':'2','c2':'15'},{'c1':'2','c2':'100'},{'c1':'1','c2':'11'}])
print(df1.sort_values(['c1', 'c2']))
the first one returns, which is expected
a b
6 1 24
5 1 284
4 2 12
3 2 29
2 2 88
7 3 1
1 3 10
0 4 23
the 2nd one returns
c1 c2
5 1 11
2 1 20
4 2 100
3 2 15
1 2 30
0 3 10
I would think it would return
c1 c2
5 1 11
2 1 20
4 2 15
3 2 30
1 2 100
0 3 10
can someone tell me why they behave differently? Thanks.
Upvotes: 2
Views: 3784
Reputation: 839
Sounds like you want to sort a dataframe with string
columns. You want to sort the values as if they were ints but want to keep the datatype of the column as string
for some reason.
The best way to keep the datatype intact but sort as integer is to use the key
parameter in sort_values()
:
df1 = pd.DataFrame(
[{'c1':'3', 'c2':'10'},
{'c1':'2', 'c2':'30'},
{'c1':'1', 'c2':'20'},
{'c1':'2', 'c2':'15'},
{'c1':'2', 'c2':'100'},
{'c1':'1', 'c2':'11'}])
df1.sort_values(['c1', 'c2'], key=lambda s: s.astype(int))
The key
parameter is applied to every column you're sorting by, so if you have an additional column that you also want to sort by but cannot convert to int, it will fail. The astype()
function does let you handle that case:
df2 = pd.DataFrame(
[{'c1':'3', 'c2':'10', 'c3':'a'},
{'c1':'2', 'c2':'15', 'c3':'a'},
{'c1':'1', 'c2':'20', 'c3':'b'},
{'c1':'2', 'c2':'15', 'c3':'b'},
{'c1':'2', 'c2':'100', 'c3':'a'},
{'c1':'1', 'c2':'20', 'c3':'a'}])
(df2
.sort_values(['c1', 'c2'], key=lambda s: s.astype(int, errors='ignore'))
)
More generally, if you have multiple columns with different conversions, then you have to call sort_values
multiple times:
df3 = pd.DataFrame(
[{'c1':'3', 'c2':'10', 'c3':'a', 'c4':'1/6/2023'},
{'c1':'2', 'c2':'15', 'c3':'b', 'c4':'2/6/2023'},
{'c1':'1', 'c2':'20', 'c3':'b', 'c4':'1/6/2023'},
{'c1':'2', 'c2':'15', 'c3':'a', 'c4':'4/6/2023'},
{'c1':'2', 'c2':'15', 'c3':'a', 'c4':'3/6/2023'},
{'c1':'2', 'c2':'100', 'c3':'a', 'c4':'2/6/2023'},
{'c1':'1', 'c2':'20', 'c3':'a', 'c4':'3/6/2023'}])
(df3
.sort_values('c4', key=lambda s: s.astype('datetime64[ns]'))
.sort_values('c3')
.sort_values(['c1', 'c2'], key=lambda s: s.astype(int))
)
Upvotes: 0
Reputation: 863351
There is problem your data are strings, not numbers, so are sorted alphabetically.
Solution is convert column to int
:
df1['c2'] = df1['c2'].astype(int)
print(df1.sort_values(['c1', 'c2']))
c1 c2
5 1 11
2 1 20
3 2 15
1 2 30
4 2 100
0 3 10
Upvotes: 3