fecke9296
fecke9296

Reputation: 130

Pandas sort_value() issue. Wrong sorting integer when applied key parameter

Hi everyone I have tried to look everywhere for this issue but I cannot figure a solution out. I'd be glad if you'd help me.

So, basically I have this dataset:

df = pd.DataFrame({"col1": ['xxx', 'xxx', 'xxx', 'kkk', 'www', 'www'],
                 "col2": [ 2020, 1994, 2013, 1000, 1996, 2021]})

df.dtypes
col1    object
col2     int64
dtype: object

and I want to order the first column with a costum order and the second column with ascending order. The final result should be the following:

    col1    col2
4    www    1996
5    www    2021
3    kkk    1000
0    xxx    1994
1    xxx    2013
2    xxx    2020

So, in order to accomplish that I do this:

d = {'xxx': 4, 'zzz':1, 'yyy':5, 'kkk':2, 'jjj':3, 'www':0} # to customize order 

df.sort_values(by = ['col1' , 'col2'], key = lambda x: x.map(d))

but I end up with this:

    col1    col2
4    www    1996
5    www    2021
3    kkk    1000
0    xxx    2020
1    xxx    1994
2    xxx    2013

If I only do:

df.sort_values(by = ['col1' , 'col2'])

    col1    col2
3    kkk    1000
4    www    1996
5    www    2021
1    xxx    1994
2    xxx    2013
0    xxx    2020

The col2 is ordered fine. I really don't know why I am having this issue. Has anyone experienced something similar? Thanks

Upvotes: 5

Views: 584

Answers (6)

Victor Ermakov
Victor Ermakov

Reputation: 491

create new column, map it to your dictionary, sort by it, then delete :)

df = pd.DataFrame({"col1": ['xxx', 'xxx', 'xxx', 'kkk', 'www', 'www'],
                 "col2": [ 2020, 1994, 2013, 1000, 1996, 2021]})

d = {'xxx': 4, 'zzz':1, 'yyy':5, 'kkk':2, 'jjj':3, 'www':0} # to customize order

df['sorting'] = df['col1'].map(d)
df = df.sort_values('sorting')
del df['sorting']

Upvotes: 1

jezrael
jezrael

Reputation: 863156

Possible trick is expand dictionary by values from col2:

d = {'xxx': 4, 'zzz':1, 'yyy':5, 'kkk':2, 'jjj':3, 'www':0} # to customize order 
d = {**d, **dict(zip(df.col2, df.col2))}

df = df.sort_values(by = ['col1' , 'col2'], key = lambda x: x.map(d))
print (df)
  col1  col2
4  www  1996
5  www  2021
3  kkk  1000
1  xxx  1994
2  xxx  2013
0  xxx  2020

Or solution with get - if no match is returned same value instead NaN:

df = df.sort_values(by = ['col1' , 'col2'], key = lambda x: x.map(lambda y: d.get(y, y)))
print (df)
  col1  col2
4  www  1996
5  www  2021
3  kkk  1000
1  xxx  1994
2  xxx  2013
0  xxx  2020

Solution with helper column:

d = {'xxx': 4, 'zzz':1, 'yyy':5, 'kkk':2, 'jjj':3, 'www':0} 

df = df.assign(new=df['col1'].map(d)).sort_values(by=['new','col2']).drop('new', axis=1)
print (df)
  col1  col2
4  www  1996
5  www  2021
3  kkk  1000
1  xxx  1994
2  xxx  2013
0  xxx  2020

Upvotes: 1

STerliakov
STerliakov

Reputation: 7913

Well, it's not a bug. What do you expect from key when passing it to multiple columns? It will be applied to each of columns and used as sorting key. You don't have any of col2 elements in d, so it returns None (I think so because no error is raised, thus it may use d.get(x, None)). For your use case the most simple solution will be sorting twice:

df.sort_values(by='col2').sort_values(by='col1', key=lambda x: x.map(d))

I have also checked whether my assumptions were correct. If you set key=x.map(lambda t: d[t]) instead of x.map(d) in your code example, error is raised saying that key is unmatched. So you really try to apply it to both columns and sort by results.

Upvotes: 1

Andreas
Andreas

Reputation: 9207

Pandas has a datatype "categorical" which can do that:

import pandas as pd
df = pd.DataFrame({"col1": ['xxx', 'xxx', 'xxx', 'kkk', 'www', 'www'],
                 "col2": [ 2020, 1994, 2013, 1000, 1996, 2021]})

df["col1"] = pd.Categorical(df["col1"], ordered=True,
             categories=['www', 'zzz', 'kkk', 'jjj', 'xxx', 'yyy'])


df.sort_values(["col1", "col2"], ascending = (True, True))

  col1  col2
4  www  1996
5  www  2021
3  kkk  1000
1  xxx  1994
2  xxx  2013
0  xxx  2020

Upvotes: 1

Rob Raymond
Rob Raymond

Reputation: 31206

Change col1 to be a categorical where it is ordered.

df = pd.DataFrame({"col1": ['xxx', 'xxx', 'xxx', 'kkk', 'www', 'www'],
                 "col2": [ 2020, 1994, 2013, 1000, 1996, 2021]})
d = {'xxx': 4, 'zzz':1, 'yyy':5, 'kkk':2, 'jjj':3, 'www':0} # to customize order 

df["col1"] = pd.Categorical(df["col1"], sorted(list(d.keys()), key=lambda x: d[x]), ordered=True)
df.sort_values(["col1","col2"])
col1 col2
4 www 1996
5 www 2021
3 kkk 1000
1 xxx 1994
2 xxx 2013
0 xxx 2020

Upvotes: 1

Nk03
Nk03

Reputation: 14949

Try this. -

df  =df.groupby('col1')['col2'].apply(sorted).explode().reset_index()
d = {'xxx': 4, 'zzz':1, 'yyy':5, 'kkk':2, 'jjj':3, 'www':0} 
df['sorted'] = df['col1'].map(d)
df = df.sort_values('sorted').drop('sorted', axis=1)

Upvotes: 1

Related Questions