Reputation: 41
I have a pandas dataframe with several series and I sort this dataframe by some of them:
df = df.sort_values(by=['Col1', 'Col2', 'Col3', 'Col4', 'Col5'])
This works fine! But german umlauts are not in the "german order", so Ä is not between A and B, but instead is sorted after Z. Also "denmark" ist sorted after "Zimmermann", because the pandas sorting algorithm seems to be case-sensetive.
I found solutions for sorting a dataframe by one Series (e. g. here) but no solution for sorting by several series. In all series, umlauts are possible. So I tried a bit - maybe this helps someone. :-)
Solution:
df = df.sort_values(by=['Col1', 'Col2', 'Col3', 'Col4', 'Col5'], key=lambda col: col.str.lower().str.normalize('NFD'))
str.lower()
solves the problem, that a string "denmark" ist sorted after "Zimmermann"
str.normalize('NFD')
solves the problem with the german umlauts.
Upvotes: 1
Views: 316
Reputation: 41
Solution as an answer, so I can accept it as solved:
df = df.sort_values(by=['Col1', 'Col2', 'Col3', 'Col4', 'Col5'], key=lambda col: col.str.lower().str.normalize('NFD'))
str.lower()
solves the problem, that a string "denmark" ist sorted after "Zimmermann"
str.normalize('NFD')
solves the problem with the german umlauts.
Upvotes: 1