Reputation: 415
I´m trying to clean some strange unicode characters after my HTML parsing, but is still not converting these unicodes.
Original text:
raw = 'If further information is needed, don´t hesitate to contact us. Kind regards, José Ramirez.'
After encoding & decoding:
text = str(raw.encode().decode('unicode_escape'))
Current output:
'If further information is needed, donÃ\x82´t hesitate to contact us. Kind regards, JosÃ\x83© Ramirez'
Desired output:
'If further information is needed, don´t hesitate to contact us. Kind regards, José Ramirez'
Upvotes: 1
Views: 2908
Reputation: 27323
You're doing it the wrong way around. The effect of your raw.encode().decode('unicode_escape')
is the same as raw.encode('utf-8').decode('latin-1')
. What you really want:
>>> raw.encode('latin-1').decode('utf-8')
'If further information is needed, don´t hesitate to contact us. Kind regards, José Ramirez.'
Your string came from someone taking UTF-8 encoded text, but assuming it is Latin-1.
If you have many different variants of Mojibake (the incorrect decoding of text, resulting in gibberish), the ftfy
packages can help:
>>> import ftfy
>>> ftfy.fix_text('If further information is needed, don´t hesitate to contact us. Kind regards, José Ramirez.')
'If further information is needed, don´t hesitate to contact us. Kind regards, José Ramirez.'
Upvotes: 1