dan gibson
dan gibson

Reputation: 3665

In the JSON spec, what does "Since the first two characters of a JSON text will always be ASCII characters" mean?

RFC 4627 on Json reads:

  1. Encoding

    JSON text SHALL be encoded in Unicode. The default encoding is UTF-8.

    Since the first two characters of a JSON text will always be ASCII characters [RFC0020], it is possible to determine whether an octet stream is UTF-8, UTF-16 (BE or LE), or UTF-32 (BE or LE) by looking at the pattern of nulls in the first four octets.

What does it mean "Since the first two characters of a JSON text will always be ASCII characters [RFC0020]"? I've looked at RFC0020 but couldn't find anything about it. JSON could be {" or { " (ie whitespace before the quote.

Upvotes: 8

Views: 3539

Answers (2)

Tom Blodget
Tom Blodget

Reputation: 20802

RFC 4627 requires a JSON document to represent either an object or an array. So, the first characters must be (with any amount of JSON whitespace characters) [ followed by a value or { followed by ". Values are null, true, false, or a string ("…), object or array. So, since JSON whitespace characters, [, {, n, t, f, and " are in the C0 Controls and Basic Latin block, they are also in the ASCII character set [by the design of Unicode]. (Not sure why the standard is fixated on "ASCII" when it says, "JSON text SHALL be encoded in Unicode." Future standards drop the reference.)

UTF-32 has four bytes per character. UTF-16 has two. So, to distinguish between UTF-16 and UTF-32, you need 4 bytes. In both of those encodings, characters from the C0 Controls and Basic Latin block are encoded with at most one non-zero byte (a byte with a value of 0 is sometimes called a "null byte"). Also, U+0000 (which is encoded as 0x00 0x00 0x00 0x00 in UTF-32 and 0x00 0x00 in UTF-16) is not valid JSON whitespace. So, the pattern of 0x00 bytes can be used to determine which of the allowed encodings, a valid JSON document uses.

RFC 7159 changed JSON to allow a JSON document to represent any value, not just an object or array. So, the statement in the previous standard is no longer valid. Therefore, the character detection algorithm was broken and removed from the standard.

For accurate detection, you need to see the beginning and the end of the document. 0x22 0x00 0x00 0x00 at the beginning could be any of UTF-8, UTF-16LE, and UTF-32LE; It's the start of a string with zero or more U+0000 characters. In this case, you need the number of 0x00 bytes at the end to tell which.

RFC 8259 changed JSON to require UTF-8 (for JSON "exchanged between systems that are not part of a closed ecosystem"). Out of practically, a JSON reader would still accept UTF-16 and UTF-32.


In the end, some popular JSON parsers leave character decoding up to the caller, having APIs that accept only the "native" string type for the programming environment. (This opens up the very common hazard of using the wrong character encoding for reading text files or HTTP bodies.)

Upvotes: 6

Oded
Oded

Reputation: 499062

It means that since JSON will always start with ASCII characters (non-ASCII is only permitted in strings, which cannot be the root object), it is possible to determine from the start of the stream/file what encoding it is in.

UTF-16 and UTF-32 should have a BOM that appears at the start of the stream and by finding out what it is, you can determine the exact encoding. This is possible as one can determine if the first characters are JSON or not.

I assume the spec specifically mentions this as for many other text streams/files, this is not always possible (as most text files can start with any two characters and the two first bytes of the actual file are not known in advance).

Upvotes: 8

Related Questions