Json decoder not consistent when unicode data present

Question

(This question is related to this one)

Take a look at the following session:

Python 2.7.3 (default, Jan  2 2013, 16:53:07) 
[GCC 4.7.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> 
>>> import simplejson as json
>>> 
>>> my_json = '''[
...   {
...     "id" : "normal",
...     "txt" : "This is a normal entry"
...   },
...   {
...     "id" : "αβγδ",
...     "txt" : "This is a unicode entry"
...   }
... ]'''
>>> 
>>> cache = json.loads(my_json, encoding='utf-8')
>>> 
>>> cache
[{'txt': 'This is a normal entry', 'id': 'normal'}, {'txt': 'This is a unicode entry', 'id': u'\u03b1\u03b2\u03b3\u03b4'}]

Why is the json decoder producing sometimes unicode, and sometimes plain strings? Isn't it supposed to produce always unicode?

Lycha · Accepted Answer

It seems to be an optimization in simplejson, from simplejson docs:

If s is a str then decoded JSON strings that contain only ASCII characters may be parsed as str for performance and memory reasons. If your code expects only unicode the appropriate solution is decode s to unicode prior to calling decode.

Note: Any any characters included in ASCII are encoded the same in both UTF-8 and ASCII. So ASCII is a subset of UTF-8.

Json decoder not consistent when unicode data present

Answers (1)

Related Questions