HFPSY
HFPSY

Reputation: 41

Sort Pandas Dataframe which contains german umlauts [Solution]

I have a pandas dataframe with several series and I sort this dataframe by some of them:

df = df.sort_values(by=['Col1', 'Col2', 'Col3', 'Col4', 'Col5'])

This works fine! But german umlauts are not in the "german order", so Ä is not between A and B, but instead is sorted after Z. Also "denmark" ist sorted after "Zimmermann", because the pandas sorting algorithm seems to be case-sensetive.

I found solutions for sorting a dataframe by one Series (e. g. here) but no solution for sorting by several series. In all series, umlauts are possible. So I tried a bit - maybe this helps someone. :-)


Solution:

df = df.sort_values(by=['Col1', 'Col2', 'Col3', 'Col4', 'Col5'], key=lambda col: col.str.lower().str.normalize('NFD'))

str.lower() solves the problem, that a string "denmark" ist sorted after "Zimmermann"

str.normalize('NFD') solves the problem with the german umlauts.

Upvotes: 1

Views: 316

Answers (1)

HFPSY
HFPSY

Reputation: 41

Solution as an answer, so I can accept it as solved:

df = df.sort_values(by=['Col1', 'Col2', 'Col3', 'Col4', 'Col5'], key=lambda col: col.str.lower().str.normalize('NFD'))

str.lower() solves the problem, that a string "denmark" ist sorted after "Zimmermann"

str.normalize('NFD') solves the problem with the german umlauts.

Upvotes: 1

Related Questions