Reputation: 312
I have been trying to learn Python recently and following along with the book, Python for Data Analysis and using Python 2.7 with Canopy. In the book, they provided a link to some raw data which I saved and assigned to a path
variable. After trying to convert the text file to a list of dictionaries using JSON:
records = [json.loads(line) for line in open(path)]
I received the following error:
---------------------------------------------------------------------------
UnicodeDecodeError Traceback (most recent call last)
<ipython-input-17-b1e0b494454a> in <module>()
----> 1 records = [json.loads(line) for line in open(path)]
C:\Users\Marc\AppData\Local\Enthought\Canopy\App\appdata\canopy-1.4.1.1975.win- x86_64\lib\json\__init__.pyc in loads(s, encoding, cls, object_hook, parse_float, parse_int, parse_constant, object_pairs_hook, **kw)
336 parse_int is None and parse_float is None and
337 parse_constant is None and object_pairs_hook is None and not kw):
--> 338 return _default_decoder.decode(s)
339 if cls is None:
340 cls = JSONDecoder
C:\Users\Marc\AppData\Local\Enthought\Canopy\App\appdata\canopy-1.4.1.1975.win- x86_64\lib\json\decoder.pyc in decode(self, s, _w)
363
364 """
--> 365 obj, end = self.raw_decode(s, idx=_w(s, 0).end())
366 end = _w(s, end).end()
367 if end != len(s):
C:\Users\Marc\AppData\Local\Enthought\Canopy\App\appdata\canopy-1.4.1.1975.win-x86_64\lib\json\decoder.pyc in raw_decode(self, s, idx)
379 """
380 try:
--> 381 obj, end = self.scan_once(s, idx)
382 except StopIteration:
383 raise ValueError("No JSON object could be decoded")
UnicodeDecodeError: 'utf8' codec can't decode byte 0x92 in position 6: invalid start byte
The weird thing is that this worked on a different computer, which I thought was using the same version of Python. Thanks in advance.
Upvotes: 1
Views: 1129
Reputation: 1121266
The data in question contains one U+2019 RIGHT SINGLE QUOTATION MARK character, encoded to UTF-8. But you used copy-and-paste to save the data rather than save the text straight to disk.
In doing so, somewhere along the way the data was decoded, then encoded again, to Windows Codepage 1252:
>>> u'\u2019'.encode('cp1252')
'\x92'
In other words, your data file is not the same. It probably contains the same data but using a different encoding.
The JSON standard states data needs to be encoded to UTF-8, UTF-16 or UTF-32, with UTF-8 being the default, and that is what the Python json
module will use if you don't give it an encoding. Because you are feeding it CP-1252 data instead, the decoding fails.
Upvotes: 2