Paul Draper
Paul Draper

Reputation: 83245

How to send Unicode JSON over HTTP?

RFC 4627 section 3 says

JSON text SHALL be encoded in Unicode. The default encoding is UTF-8.

Since the first two characters of a JSON text will always be ASCII characters [RFC0020], it is possible to determine whether an octet stream is UTF-8, UTF-16 (BE or LE), or UTF-32 (BE or LE) by looking at the pattern of nulls in the first four octets.

       00 00 00 xx  UTF-32BE
       00 xx 00 xx  UTF-16BE
       xx 00 00 00  UTF-32LE
       xx 00 xx 00  UTF-16LE
       xx xx xx xx  UTF-8

I'm serving a UTF-8 encoded JSON string of U+20AC as application/json.

$ curl -D - http://localhost:8000/test.json
HTTP/1.0 200 OK
Server: SimpleHTTP/0.6 Python/2.7.6
Date: Fri, 15 Jan 2016 09:24:53 GMT
Content-type: application/json
Content-Length: 6
Last-Modified: Fri, 15 Jan 2016 09:23:13 GMT

"€"

$ curl -s http://localhost:8000/test.json | hexdump
0000000 e222 ac82 0a22                         
0000006

But both Chrome and Firefox seemed to be using some other encoding, as both show

"€"

If change the Content-Type to application/json; charset=utf-8, they show the expected result.

But charset is a made up addition to application/json, and I'm not sure if it is legal to add extra parameters to it.

This is all rather confusing.

Is there a bug somewhere? What's the correct way for me to transmit UTF-8 encoded JSON documents over HTTP?

Upvotes: 2

Views: 4663

Answers (2)

bobince
bobince

Reputation: 536379

Your response is correct. charset shouldn't do anything on application/json as that's a parameter of text/ types; a JSON processor will ignore it.

The problem is that Chrome and Firefox aren't acting as JSON processors here; they aren't parsing or validating anything in the response content. They're falling back to their regular old text viewers to display the content as if it were text/plain, on the premise that this is better than nothing.

Unfortunately plain text viewers have their own rules about guessing encodings that do not match JSON's in-content-signalling-only rules. IE's behaviour of treating application/json as an unknown binary type and prompting you to download it is actually the more-correct thing to do.

Upvotes: 4

Pekka
Pekka

Reputation: 449415

The content-type header field you are using is perfectly valid.

The part of RFC 2616 that defines header fields in HTTP 1.1 treats adding the encoding after a semicolon as a valid way to go.

If you do not specify a character set, the browser will either use the defined default (ISO-8859-1, this happened in your case) or, depending on its settings, try to auto-detect the character set.

As Julian points out (and you probably already knew) the application/json content type doesn't require, nor need an addition of charset.

It appears that browsers are handling application/json responses incorrectly and falling back to ISO-8859-1 even though they shouldn't.

Upvotes: 1

Related Questions