Reputation: 83245
RFC 4627 section 3 says
JSON text SHALL be encoded in Unicode. The default encoding is UTF-8.
Since the first two characters of a JSON text will always be ASCII characters [RFC0020], it is possible to determine whether an octet stream is UTF-8, UTF-16 (BE or LE), or UTF-32 (BE or LE) by looking at the pattern of nulls in the first four octets.
00 00 00 xx UTF-32BE
00 xx 00 xx UTF-16BE
xx 00 00 00 UTF-32LE
xx 00 xx 00 UTF-16LE
xx xx xx xx UTF-8
I'm serving a UTF-8 encoded JSON string of U+20AC as application/json.
$ curl -D - http://localhost:8000/test.json
HTTP/1.0 200 OK
Server: SimpleHTTP/0.6 Python/2.7.6
Date: Fri, 15 Jan 2016 09:24:53 GMT
Content-type: application/json
Content-Length: 6
Last-Modified: Fri, 15 Jan 2016 09:23:13 GMT
"€"
$ curl -s http://localhost:8000/test.json | hexdump
0000000 e222 ac82 0a22
0000006
But both Chrome and Firefox seemed to be using some other encoding, as both show
"€"
If change the Content-Type
to application/json; charset=utf-8
, they show the expected result.
But charset
is a made up addition to application/json
, and I'm not sure if it is legal to add extra parameters to it.
This is all rather confusing.
Is there a bug somewhere? What's the correct way for me to transmit UTF-8 encoded JSON documents over HTTP?
Upvotes: 2
Views: 4663
Reputation: 536379
Your response is correct. charset
shouldn't do anything on application/json
as that's a parameter of text/
types; a JSON processor will ignore it.
The problem is that Chrome and Firefox aren't acting as JSON processors here; they aren't parsing or validating anything in the response content. They're falling back to their regular old text viewers to display the content as if it were text/plain
, on the premise that this is better than nothing.
Unfortunately plain text viewers have their own rules about guessing encodings that do not match JSON's in-content-signalling-only rules. IE's behaviour of treating application/json
as an unknown binary type and prompting you to download it is actually the more-correct thing to do.
Upvotes: 4
Reputation: 449415
The content-type
header field you are using is perfectly valid.
The part of RFC 2616 that defines header fields in HTTP 1.1 treats adding the encoding after a semicolon as a valid way to go.
If you do not specify a character set, the browser will either use the defined default (ISO-8859-1, this happened in your case) or, depending on its settings, try to auto-detect the character set.
As Julian points out (and you probably already knew) the application/json
content type doesn't require, nor need an addition of charset
.
It appears that browsers are handling application/json
responses incorrectly and falling back to ISO-8859-1 even though they shouldn't.
Here is an open bug report for Chromium
Here is a discussion about making the same server-side change you did
Upvotes: 1