Reputation: 3605
I use sort_values to sort a dataframe. The dataframe contains UTF-8 characters with accents. Here is an example:
>>> df = pd.DataFrame ( [ ['i'],['e'],['a'],['é'] ] )
>>> df.sort_values(by=[0])
0
2 a
1 e
0 i
3 é
As you can see, the "é" with an accent is at the end instead of being after the "e" without accent.
Note that the real dataframe has several columns !
Upvotes: 4
Views: 1881
Reputation: 164773
This is one way. The simplest solution, as suggested by @JonClements:
df = df.iloc[df[0].str.normalize('NFKD').argsort()]
An alternative, long-winded solution, normalization code courtesy of @EdChum:
df = pd.DataFrame([['i'],['e'],['a'],['é']])
df = df.iloc[df[0].str.normalize('NFKD').argsort()]
# remove accents
df[1] = df[0].str.normalize('NFKD')\
.str.encode('ascii', errors='ignore')\
.str.decode('utf-8')
# sort by new column, then drop
df = df.sort_values(1, ascending=True)\
.drop(1, axis=1)
print(df)
0
2 a
1 e
3 é
0 i
Upvotes: 6