Python Polish character encoding issues

Question

I'm having some issues with character encoding, and in this special case with Polish characters.

I need to replace all none windows-1252 characters with a windows-1252 equivalent. I had this working until I needed to work with Polish characters. How can I replace these characters?

The é for example is a windows-1252 character and must stay this way. But the ł is not a windows-1252 character and must be replaced with its equivalent (or stripped if it hasn't a equivalent).

I tried this:

import unicodedata

text = "Racławicka Rógé"
tmp = unicodedata.normalize('NFKD', text).encode('ascii', 'ignore')
print(tmp.decode("utf-8"))

This prints:

Racawicka Roge

But now the ó and é are both encoded to o and e.

How can I get this right?

jonrsharpe · Accepted Answer

If you want to move to 1252, that's what you should tell encode and decode:

>>> text = "Racławicka Rógé"
>>> text.encode('1252', 'ignore').decode('1252')
'Racawicka Rógé'

Python Polish character encoding issues

Answers (2)

Related Questions