Reputation: 960
I've got utf-8 string in the form of 'РїРѕРј'... - in Python 3 string. How can I decode it (to get correct string)?
As I see from error messages I can only convert string from bytes array, but how to get it then? I tried
bytes(str, 'ascii', errors='ignore')
so it should not change existing byte values, but it removed all "incorrect" characters (I suppose because they have codes >= 128).
The example string contains Russian 'пом'...
Upvotes: 0
Views: 193
Reputation: 55600
It looks like you have a string that has been encoded as UTF-8, then decoded as cp1251.
>>> s = 'пом'
>>> s.encode('utf-8').decode('cp1251')
'РїРѕРј'
You can get the original string by reversing the operation.
>>> e = 'РїРѕРј'
>>> e.encode('cp1251').decode('utf-8')
'пом'
If you want to encode the mojibake string as bytes, without losing information, use the backslashreplace error handler.
>>> e.encode('ascii', errors='backslashreplace')
b'\\u0420\\u0457\\u0420\\u0455\\u0420\\u0458'
Upvotes: 2