Light Yagmi
Light Yagmi

Reputation: 5235

UnicodeDecodeError on json.loads after json.dump with ensure_ascii=False in python 2.x

I dump dict object using json.dump. To avoid UnicodeDecodeError, I set ensure_ascii=False following this advice.

with open(my_file_path, "w") as f:
    f.write(json.dumps(my_dict, ensure_ascii=False))

The dump file has been successfully created, but when loading the dumped file UnicodeDecodeError happens:

with open(my_file_path, "r") as f:
    return json.loads(f.read())

How to avoid UnicodeDecodeError on loading the dump file?

Error message and stacktrace

The Error message is UnicodeDecodeError: 'utf8' codec can't decode byte 0x93 in position 0: invalid start byte and stacktrace is:

/Users/name/.pyenv/versions/anaconda-2.0.1/python.app/Contents/lib/python2.7/json/__init__.pyc in loads(s, encoding, cls, object_hook, parse_float, parse_int, parse_constant, object_pairs_hook, **kw)
    336     if (cls is None and encoding is None and object_hook is None and
    337             parse_int is None and parse_float is None and
--> 338             parse_constant is None and object_pairs_hook is None and not kw):
    339         return _default_decoder.decode(s)
    340     if cls is None:

/Users/name/.pyenv/versions/anaconda-2.0.1/python.app/Contents/lib/python2.7/json/decoder.pyc in decode(self, s, _w)
    364         obj, end = self.raw_decode(s, idx=_w(s, 0).end())
    365         end = _w(s, end).end()
--> 366         if end != len(s):
    367             raise ValueError(errmsg("Extra data", s, end, len(s)))
    368         return obj

/Users/name/.pyenv/versions/anaconda-2.0.1/python.app/Contents/lib/python2.7/json/decoder.pyc in raw_decode(self, s, idx)
    380             obj, end = self.scan_once(s, idx)
    381         except StopIteration:
--> 382             raise ValueError("No JSON object could be decoded")
    383         return obj, end

UnicodeDecodeError: 'utf8' codec can't decode byte 0x93 in position 0: invalid start byte

Upvotes: 3

Views: 9620

Answers (1)

unutbu
unutbu

Reputation: 879651

In Python2, you could use ensure_ascii=False and decode the result before calling json.loads:

import json

my_dict = {b'\x93': [b'foo', b'\x93', {b'\x93': b'\x93'}]}

dumped = json.dumps(my_dict, ensure_ascii=False)
print(repr(dumped))
# '{"\\u201c": ["foo", "\\u201c", {"\\u201c": "\\u201c"}]}'
result = json.loads(dumped.decode('cp1252'))
print(result)
# {u'\u201c': [u'foo', u'\u201c', {u'\u201c': u'\u201c'}]}

However, note that the result returned by json.loads contains unicode, not strs. So the result is not exactly the same as my_dict.

Note that json.loads always decodes strings to unicode, so if you are interested in faithfully recovering the dict using json.dumps and json.loads, then you need to start with a dict which contains only unicode, no strs.

Moreover, in Python3 json.dumps requires all dicts to have keys which are unicode strings. So the above solution does not work in Python3.


An alternative which will work in both Python2 and Python3 is to make sure you pass json.loads a dict whose keys and values are unicode (or contain no strs). For example, if you use convert (below) to recursively change the keys and values to unicode before passing them to json.loads:

import json

def convert(obj, enc):
    if isinstance(obj, str):
        return obj.decode(enc)
    if isinstance(obj, (list, tuple)):
        return [convert(item, enc) for item in obj]
    if isinstance(obj, dict):
        return {convert(key, enc) : convert(val, enc)
                for key, val in obj.items()}
    else: return obj

my_dict = {'\x93': ['foo', '\x93', {'\x93': '\x93'}]}
my_dict = convert(my_dict, 'cp1252')

dumped = json.dumps(my_dict)
print(repr(dumped))
# '{"\\u201c": ["foo", "\\u201c", {"\\u201c": "\\u201c"}]}'
result = json.loads(dumped)
print(result)
# {u'\u201c': [u'foo', u'\u201c', {u'\u201c': u'\u201c'}]}
assert result == my_dict

convert will decode all strs found in lists, tuples and dicts inside my_dict.

Above, I used 'cp1252' as the encoding since (as Fumu pointed out) '\x93' decoded with cp1252 is a LEFT DOUBLE QUOTATION MARK:

In [18]: import unicodedata as UDAT

In [19]: UDAT.name('\x93'.decode('cp1252'))
Out[19]: 'LEFT DOUBLE QUOTATION MARK'

If you know the strs in my_dict have been encoded in some other encoding, you should of course call convert using that encoding instead.


Even better, instead of using convert, take care to ensure all strs are decoded to unicode as you are building my_dict.

Upvotes: 2

Related Questions