JSON stored as UTF-8 requires two encoding conversions

Question

A JSON string can contain the escape sequence: \u four-hex-digits, which are two octets.

After reading the four hex digits into c1, c2, c3, c4, the JSON Spirit C++ library returns a single character whose value is (hex_to_num (c1) << 12) + (hex_to_num (c2) << 8) + (hex_to_num (c3) << 4) + hex_to_num (c4).

Based on the simplicity of the decoding scheme, and based on having only 2 octets to decode, I conclude that JSON escape sequences support only UCS-2 encoding, which is text from the BMP U+0000 to U+FFFF encoded "as is" using the code point as the 16-bit code unit.

Since UTF-16 and UCS-2 encode valid code points in U+0000 to U+FFFF as single 16-bit code units that are numerically equal to the corresponding code points (wikipedia), one can simply pretend that the decoded UCS-2 character is a UTF-16 character.

The escape character varies from a normal unescaped JSON string, which can contain "any Unicode character except " or \ or control-character"(json spec). Since JSON is a subset of ECMAScript, which is assumed to be UTF-16 (ecma standard), I conclude that JSON supports UTF-16 encoding, which is broader than what the escape sequence provides.

Now having reduced all JSON strings to UTF-16, if one converts them from UTF-16 to UTF-8, my understanding is that it is possible to store the UTF-8 in a std::string on Linux, because during processing one can often ignore that several std::string characters are consumed to represent as long as a 6-byte long UTF-8 sequence.

If all the above assumptions and interpretations are correct, one can safely parse JSON and store it into a std::string on Linux. Can someone please verify?

Nick Bastin · Accepted Answer

You are mistaken in several regards:

1) The \u escape values in JSON are UTF-16 code units, not UCS-2 code points, which despite the claims of wikipedia, are not (necessarily) the same as UCS-2 and UTF-16 are not 100% byte compatible (although they are for all characters which existed before UTF-16 was created in the Unicode 2.0 standard)

2) The JSON escape sequence can represent all of UTF-16 by using surrogate pairs of code units.

Your end assertion is still true - you can safely parse JSON and store it in a std::string, but the conversion can't be based on the assumptions you're making (and using std::string to essentially store a bundle of bytes likely isn't what you want anyhow).

JSON stored as UTF-8 requires two encoding conversions

Answers (1)

Related Questions