Reputation: 5235
I dump dict object using json.dump
. To avoid UnicodeDecodeError
, I set ensure_ascii=False
following this advice.
with open(my_file_path, "w") as f:
f.write(json.dumps(my_dict, ensure_ascii=False))
The dump file has been successfully created, but when loading the dumped file UnicodeDecodeError happens:
with open(my_file_path, "r") as f:
return json.loads(f.read())
How to avoid UnicodeDecodeError
on loading the dump file?
The Error message is UnicodeDecodeError: 'utf8' codec can't decode byte 0x93 in position 0: invalid start byte
and stacktrace is:
/Users/name/.pyenv/versions/anaconda-2.0.1/python.app/Contents/lib/python2.7/json/__init__.pyc in loads(s, encoding, cls, object_hook, parse_float, parse_int, parse_constant, object_pairs_hook, **kw)
336 if (cls is None and encoding is None and object_hook is None and
337 parse_int is None and parse_float is None and
--> 338 parse_constant is None and object_pairs_hook is None and not kw):
339 return _default_decoder.decode(s)
340 if cls is None:
/Users/name/.pyenv/versions/anaconda-2.0.1/python.app/Contents/lib/python2.7/json/decoder.pyc in decode(self, s, _w)
364 obj, end = self.raw_decode(s, idx=_w(s, 0).end())
365 end = _w(s, end).end()
--> 366 if end != len(s):
367 raise ValueError(errmsg("Extra data", s, end, len(s)))
368 return obj
/Users/name/.pyenv/versions/anaconda-2.0.1/python.app/Contents/lib/python2.7/json/decoder.pyc in raw_decode(self, s, idx)
380 obj, end = self.scan_once(s, idx)
381 except StopIteration:
--> 382 raise ValueError("No JSON object could be decoded")
383 return obj, end
UnicodeDecodeError: 'utf8' codec can't decode byte 0x93 in position 0: invalid start byte
Upvotes: 3
Views: 9620
Reputation: 879651
In Python2, you could use ensure_ascii=False
and decode the result before calling json.loads
:
import json
my_dict = {b'\x93': [b'foo', b'\x93', {b'\x93': b'\x93'}]}
dumped = json.dumps(my_dict, ensure_ascii=False)
print(repr(dumped))
# '{"\\u201c": ["foo", "\\u201c", {"\\u201c": "\\u201c"}]}'
result = json.loads(dumped.decode('cp1252'))
print(result)
# {u'\u201c': [u'foo', u'\u201c', {u'\u201c': u'\u201c'}]}
However, note that the result
returned by json.loads
contains unicode
, not str
s. So the result
is not exactly the same as my_dict
.
Note that json.loads
always decodes strings to unicode, so if you are interested in faithfully recovering the dict using json.dumps
and json.loads
, then you need to start with a dict which contains only unicode
, no str
s.
Moreover, in Python3 json.dumps
requires all dicts to have keys which are unicode strings. So the above solution does not work in Python3.
An alternative which will work in both Python2 and Python3 is to make sure you
pass json.loads
a dict whose keys and values are unicode
(or contain no
str
s). For example, if you use convert
(below) to recursively change the
keys and values to unicode
before passing them to json.loads
:
import json
def convert(obj, enc):
if isinstance(obj, str):
return obj.decode(enc)
if isinstance(obj, (list, tuple)):
return [convert(item, enc) for item in obj]
if isinstance(obj, dict):
return {convert(key, enc) : convert(val, enc)
for key, val in obj.items()}
else: return obj
my_dict = {'\x93': ['foo', '\x93', {'\x93': '\x93'}]}
my_dict = convert(my_dict, 'cp1252')
dumped = json.dumps(my_dict)
print(repr(dumped))
# '{"\\u201c": ["foo", "\\u201c", {"\\u201c": "\\u201c"}]}'
result = json.loads(dumped)
print(result)
# {u'\u201c': [u'foo', u'\u201c', {u'\u201c': u'\u201c'}]}
assert result == my_dict
convert
will decode all str
s found in lists, tuples and dicts inside my_dict
.
Above, I used 'cp1252'
as the encoding since (as Fumu pointed out) '\x93'
decoded with cp1252
is a LEFT DOUBLE QUOTATION MARK
:
In [18]: import unicodedata as UDAT
In [19]: UDAT.name('\x93'.decode('cp1252'))
Out[19]: 'LEFT DOUBLE QUOTATION MARK'
If you know the str
s in my_dict
have been encoded in some other encoding,
you should of course call convert
using that encoding instead.
Even better, instead of using convert
, take care to ensure all str
s are decoded to unicode
as you are building my_dict
.
Upvotes: 2