testerReg
testerReg

Reputation: 17

Can't decode an improperly encoded string with à character

I'm trying to encode this:

"LIAISONS Ã  NEW YORK" 

to this:

"LIAISONS à  NEW YORK"

The output of print(ascii(value)) is

'LIAISONS \xc3  NEW YORK'

I tried encoding in cp1252 first and decoding after to utf8 but I get this:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc3 in position 9: invalid continuation byte

I also tried to encode in Latin-1/ISO-8859-2 but that is not working too.

How can I do this?

Upvotes: 0

Views: 1987

Answers (1)

Martijn Pieters
Martijn Pieters

Reputation: 1122222

You can't go from your input value to your desired output, because the data is no longer complete.

If your input value was an actual Mojibake re-coding from UTF-8 to a Latin encoding, then you'd have two bytes for the à codepoint:

>>> target = "LIAISONS à NEW YORK"
>>> target.encode('UTF-8').decode('latin1')
'LIAISONS Ã\xa0 NEW YORK'

That's because the UTF-8 encoding for à is C3 A0:

>>> 'à'.encode('utf8').hex()
'c3a0'

In your input, the A0 byte (which doesn't map to a printable character in most Latin-based codecs) has been filtered out somewhere. You can't re-create it from thin air, because the C3 byte of the UTF-8 pair can precede any number of other bytes, all resulting in valid output:

>>> b'\xc3\xa1'.decode('utf8')
'á'
>>> b'\xc3\xa2'.decode('utf8')
'â'
>>> b'\xc3\xa3'.decode('utf8')
'ã'
>>> b'\xc3\xa4'.decode('utf8')
'ä'

and you can't easily pick one of those, not without additional natural language processing. The bytes 80-A0 and AD are all valid continuation bytes in UTF-8 for this case, but none of those bytes result in a printable Latin-1 character, so there are at least 18 different possibilities here.

Upvotes: 1

Related Questions