Python/JSON: How to Resolve UnicodeDecodeError

Question

I have been trying to learn Python recently and following along with the book, Python for Data Analysis and using Python 2.7 with Canopy. In the book, they provided a link to some raw data which I saved and assigned to a path variable. After trying to convert the text file to a list of dictionaries using JSON:

records = [json.loads(line) for line in open(path)]

I received the following error:

---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
 in ()
----> 1 records = [json.loads(line) for line in open(path)]

C:\Users\Marc\AppData\Local\Enthought\Canopy\App\appdata\canopy-1.4.1.1975.win-    x86_64\lib\json\__init__.pyc in loads(s, encoding, cls, object_hook, parse_float, parse_int,     parse_constant, object_pairs_hook, **kw)
    336             parse_int is None and parse_float is None and
    337             parse_constant is None and object_pairs_hook is None and not kw):
--> 338         return _default_decoder.decode(s)
    339     if cls is None:
    340         cls = JSONDecoder

C:\Users\Marc\AppData\Local\Enthought\Canopy\App\appdata\canopy-1.4.1.1975.win-    x86_64\lib\json\decoder.pyc in decode(self, s, _w)
    363 
    364         """
--> 365         obj, end = self.raw_decode(s, idx=_w(s, 0).end())
    366         end = _w(s, end).end()
    367         if end != len(s):

C:\Users\Marc\AppData\Local\Enthought\Canopy\App\appdata\canopy-1.4.1.1975.win-x86_64\lib\json\decoder.pyc in raw_decode(self, s, idx)
    379         """
    380         try:
--> 381             obj, end = self.scan_once(s, idx)
    382         except StopIteration:
    383             raise ValueError("No JSON object could be decoded")

UnicodeDecodeError: 'utf8' codec can't decode byte 0x92 in position 6: invalid start byte

The weird thing is that this worked on a different computer, which I thought was using the same version of Python. Thanks in advance.

Martijn Pieters · Accepted Answer

The data in question contains one U+2019 RIGHT SINGLE QUOTATION MARK character, encoded to UTF-8. But you used copy-and-paste to save the data rather than save the text straight to disk.

In doing so, somewhere along the way the data was decoded, then encoded again, to Windows Codepage 1252:

>>> u'\u2019'.encode('cp1252')
'\x92'

In other words, your data file is not the same. It probably contains the same data but using a different encoding.

The JSON standard states data needs to be encoded to UTF-8, UTF-16 or UTF-32, with UTF-8 being the default, and that is what the Python json module will use if you don't give it an encoding. Because you are feeding it CP-1252 data instead, the decoding fails.

Python/JSON: How to Resolve UnicodeDecodeError

Answers (1)

Related Questions