Encoding and Decoding UTF-8 and latin1

Question

I'm studying someone's code for processing data, and got errors on this line:

chars_sst_mangled = ['à', 'á', 'â', 'ã', 'æ', 'ç', 'è', 'é', 'í', 
'í', 'ï', 'ñ', 'ó', 'ô', 'ö', 'û', 'ü']
sentence_fixups = [(char.encode('utf-8').decode('latin1'), char) for char in chars_sst_mangled]

The error message is

"UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)"

I wonder what's the problem here, and how to fix it?

jfs · Accepted Answer

The code is broken.

The specific error indicates that you are trying to run Python 3 code using python2 executable:

>>> 'à'.encode('utf-8')
Traceback (most recent call last):
  File "", line 1, in 
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)

'à' is a bytestring on Python 2 and therefore calling .encode() method requires to decode the bytestring into Unicode first. It is done using sys.getdefaultencoding() that is 'ascii' in Python 2 that triggers the UnicodeDecodeError.

The correct way would be to drop bogus char.encode('utf-8').decode('latin1') conversion and use Unicode literals instead:

add the correct encoding declaration e.g., if the source file is saved using utf-8 encoding then put # -*- coding: utf-8 -*- at the top so that non-ascii characters in string literals hardcoded in the source would be interpreted correctly
also, add from __future__ import unicode_literals so that 'à' would create a Unicode string even on Python 2.

Encoding and Decoding UTF-8 and latin1

Answers (1)

Related Questions