Timo002
Timo002

Reputation: 3208

Python Polish character encoding issues

I'm having some issues with character encoding, and in this special case with Polish characters.

I need to replace all none windows-1252 characters with a windows-1252 equivalent. I had this working until I needed to work with Polish characters. How can I replace these characters?

The é for example is a windows-1252 character and must stay this way. But the ł is not a windows-1252 character and must be replaced with its equivalent (or stripped if it hasn't a equivalent).

I tried this:

import unicodedata

text = "Racławicka Rógé"
tmp = unicodedata.normalize('NFKD', text).encode('ascii', 'ignore')
print(tmp.decode("utf-8"))

This prints:

Racawicka Roge

But now the ó and é are both encoded to o and e.

How can I get this right?

Upvotes: 3

Views: 3840

Answers (2)

Fabio Menegazzo
Fabio Menegazzo

Reputation: 1249

If you are not handling with big texts, just like your example, you can make use of Unidecode library with the solution provided by jonrsharpe.

from unidecode import unidecode

text = u'Racławicka Rógé'
result = ''

for i in text:
    try:
        result += i.encode('1252').decode('1252')
    except (UnicodeEncodeError, UnicodeDecodeError):
        result += unidecode(i)

print result # which will be 'Raclawicka Rógé'

Upvotes: 0

jonrsharpe
jonrsharpe

Reputation: 122089

If you want to move to 1252, that's what you should tell encode and decode:

>>> text = "Racławicka Rógé"
>>> text.encode('1252', 'ignore').decode('1252')
'Racawicka Rógé'

Upvotes: 4

Related Questions