usual me
usual me

Reputation: 8778

json encoding - Why do utf8 and utf-8 produce different outputs?

These two commands output different results:

In [102]: json.dumps({'Café': 1}, ensure_ascii=False, encoding='utf-8')
Out[102]: '{"Caf\xc3\xa9": 1}'

In [103]: json.dumps({'Café': 1}, ensure_ascii=False, encoding='utf8')
Out[103]: u'{"Caf\xe9": 1}'

What's the difference between utf-8 and utf8?

Upvotes: 3

Views: 183

Answers (1)

Alastair McCormack
Alastair McCormack

Reputation: 27724

Notice that the second iteration returns a Unicode object.

It seems strange but the documentation calls this out:

If ensure_ascii is False, the result may contain non-ASCII characters and the return value may be a unicode instance.

It would appear that only "UTF-8" works with ensure_ascii=False AND if the input is a UTF-8 encoded string (Not Unicode). With a Unicode input:

>>> json.dumps({u'Caf€': 1}, ensure_ascii=False, encoding='utf-8')
u'{"Caf\u20ac": 1}'

With ensure_ascii=False, all other valid encodings return a Unicode instance.

If you set ensure_ascii=True, then the encoding is consistent and works with other encoding, such as "windows-1252" (The input needs to be a Unicode)

I guess the rationale is that JSON should be ASCII and all encodings should be escaped, even when it's UTF-8.

To avoid any surprises follow these rules:

For proper spec. ASCII JSON:

  1. Pass Unicode object
  2. Call:

    >>> json.dumps({u'Caf€': 1}, ensure_ascii=True)
    '{"Caf\\u20ac": 1}'
    

UTF-8 Encoded JSON:

  1. Pass Unicode object
  2. Call:

    >>> json.dumps({u'Caf€': 1}, ensure_ascii=False).encode("utf-8")
    '{"Caf\xe2\x82\xac": 1}'
    

Upvotes: 1

Related Questions