Lucas Mengual
Lucas Mengual

Reputation: 415

Encoding and Decoding special characters (Latin-1)

I´m trying to clean some strange unicode characters after my HTML parsing, but is still not converting these unicodes.

Original text:

raw = 'If further information is needed, don´t hesitate to contact us. Kind regards, José Ramirez.'

After encoding & decoding:

text = str(raw.encode().decode('unicode_escape'))

Current output:

'If further information is needed, donÃ\x82´t hesitate to contact us. Kind regards, JosÃ\x83© Ramirez'

Desired output:

'If further information is needed, don´t hesitate to contact us. Kind regards, José Ramirez'

Upvotes: 1

Views: 2908

Answers (1)

L3viathan
L3viathan

Reputation: 27323

You're doing it the wrong way around. The effect of your raw.encode().decode('unicode_escape') is the same as raw.encode('utf-8').decode('latin-1'). What you really want:

>>> raw.encode('latin-1').decode('utf-8')
'If further information is needed, don´t hesitate to contact us. Kind regards, José Ramirez.'

Your string came from someone taking UTF-8 encoded text, but assuming it is Latin-1.

If you have many different variants of Mojibake (the incorrect decoding of text, resulting in gibberish), the ftfy packages can help:

>>> import ftfy
>>> ftfy.fix_text('If further information is needed, don´t hesitate to contact us. Kind regards, José Ramirez.')
'If further information is needed, don´t hesitate to contact us. Kind regards, José Ramirez.'

Upvotes: 1

Related Questions