Reputation: 17
I'm trying to encode this:
"LIAISONS Ã NEW YORK"
to this:
"LIAISONS à NEW YORK"
The output of print(ascii(value))
is
'LIAISONS \xc3 NEW YORK'
I tried encoding in cp1252 first and decoding after to utf8 but I get this:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc3 in position 9: invalid continuation byte
I also tried to encode in Latin-1/ISO-8859-2 but that is not working too.
How can I do this?
Upvotes: 0
Views: 1987
Reputation: 1122222
You can't go from your input value to your desired output, because the data is no longer complete.
If your input value was an actual Mojibake re-coding from UTF-8 to a Latin encoding, then you'd have two bytes for the à
codepoint:
>>> target = "LIAISONS à NEW YORK"
>>> target.encode('UTF-8').decode('latin1')
'LIAISONS Ã\xa0 NEW YORK'
That's because the UTF-8 encoding for à
is C3 A0:
>>> 'à'.encode('utf8').hex()
'c3a0'
In your input, the A0
byte (which doesn't map to a printable character in most Latin-based codecs) has been filtered out somewhere. You can't re-create it from thin air, because the C3
byte of the UTF-8 pair can precede any number of other bytes, all resulting in valid output:
>>> b'\xc3\xa1'.decode('utf8')
'á'
>>> b'\xc3\xa2'.decode('utf8')
'â'
>>> b'\xc3\xa3'.decode('utf8')
'ã'
>>> b'\xc3\xa4'.decode('utf8')
'ä'
and you can't easily pick one of those, not without additional natural language processing. The bytes 80-A0 and AD are all valid continuation bytes in UTF-8 for this case, but none of those bytes result in a printable Latin-1 character, so there are at least 18 different possibilities here.
Upvotes: 1