Petri
Petri

Reputation: 5006

Why does Python json.dumps fail on mixed utf-8 & unicode strings?

Python (2.x) builtin json library supports encoding both unicode & utf-8 encoded (non-ASCII) strings - but apparently not at the same time. Try:

import json; json.dumps([u'Ä', u'Ä'.encode("utf-8")], ensure_ascii=False)

and see it raise a UnicodeDecodeError. Whereas both:

json.dumps([u'Ä'], ensure_ascii=False)

and

json.dumps([u'Ä'.encode("utf-8")], ensure_ascii=False)

...work ok.

Why does JSON encoding of data with both unicode & utf-8 encoded (non-ASCII) strings produce an UnicodeDecodeError? My Python site encoding is ASCII.

Upvotes: 3

Views: 3563

Answers (1)

RemcoGerlich
RemcoGerlich

Reputation: 31260

It doesn't work because it doesn't know what kind of output string to produce.

In my Python 2.7:

>>> json.dumps([u'Ä'], ensure_ascii=False)
u'["\xc4"]'

(a Unicode string)

and

>>> json.dumps([u'Ä'.encode("utf-8")], ensure_ascii=False)
'["\xc3\x84"]'

(a UTF8-encoded byte string)

So if you give it UTF8-encoded byte strings, it produces a UTF8-encoded byte string JSON, and if you give it Unicode strings, it produces a Unicode JSON.

If you mix them, it can't do both.

To fix this, you can give an explicit encoding argument (even though the default is correct) and it seems that it makes the result a unicode string always then:

>>> import json; json.dumps([u'Ä', u'Ä'.encode("utf-8")], ensure_ascii=False, encoding="UTF8")
u'["\xc4", "\xc4"]'

Upvotes: 3

Related Questions