Reputation: 5006
Python (2.x) builtin json library supports encoding both unicode & utf-8 encoded (non-ASCII) strings - but apparently not at the same time. Try:
import json; json.dumps([u'Ä', u'Ä'.encode("utf-8")], ensure_ascii=False)
and see it raise a UnicodeDecodeError. Whereas both:
json.dumps([u'Ä'], ensure_ascii=False)
and
json.dumps([u'Ä'.encode("utf-8")], ensure_ascii=False)
...work ok.
Why does JSON encoding of data with both unicode & utf-8 encoded (non-ASCII) strings produce an UnicodeDecodeError? My Python site encoding is ASCII.
Upvotes: 3
Views: 3563
Reputation: 31260
It doesn't work because it doesn't know what kind of output string to produce.
In my Python 2.7:
>>> json.dumps([u'Ä'], ensure_ascii=False)
u'["\xc4"]'
(a Unicode string)
and
>>> json.dumps([u'Ä'.encode("utf-8")], ensure_ascii=False)
'["\xc3\x84"]'
(a UTF8-encoded byte string)
So if you give it UTF8-encoded byte strings, it produces a UTF8-encoded byte string JSON, and if you give it Unicode strings, it produces a Unicode JSON.
If you mix them, it can't do both.
To fix this, you can give an explicit encoding argument (even though the default is correct) and it seems that it makes the result a unicode string always then:
>>> import json; json.dumps([u'Ä', u'Ä'.encode("utf-8")], ensure_ascii=False, encoding="UTF8")
u'["\xc4", "\xc4"]'
Upvotes: 3