Reputation: 3208
I'm having some issues with character encoding, and in this special case with Polish characters.
I need to replace all none windows-1252 characters with a windows-1252 equivalent. I had this working until I needed to work with Polish characters. How can I replace these characters?
The é
for example is a windows-1252 character and must stay this way. But the ł
is not a windows-1252 character and must be replaced with its equivalent (or stripped if it hasn't a equivalent).
I tried this:
import unicodedata
text = "Racławicka Rógé"
tmp = unicodedata.normalize('NFKD', text).encode('ascii', 'ignore')
print(tmp.decode("utf-8"))
This prints:
Racawicka Roge
But now the ó
and é
are both encoded to o
and e
.
How can I get this right?
Upvotes: 3
Views: 3840
Reputation: 1249
If you are not handling with big texts, just like your example, you can make use of Unidecode library with the solution provided by jonrsharpe.
from unidecode import unidecode
text = u'Racławicka Rógé'
result = ''
for i in text:
try:
result += i.encode('1252').decode('1252')
except (UnicodeEncodeError, UnicodeDecodeError):
result += unidecode(i)
print result # which will be 'Raclawicka Rógé'
Upvotes: 0
Reputation: 122089
If you want to move to 1252
, that's what you should tell encode
and decode
:
>>> text = "Racławicka Rógé"
>>> text.encode('1252', 'ignore').decode('1252')
'Racawicka Rógé'
Upvotes: 4