Reputation: 1868
The official JSON standard (ECMA-404, 2e, 2017-Dec) states the following about Unicode surrogate pairs:
However, whether a processor of JSON texts interprets such a surrogate pair as a single code point or as an explicit surrogate pair is a semantic decision that is determined by the specific processor.
What exactly is meant by this sentence?
Note that the same document states earlier:
Any code point may be represented as a hexadecimal escape sequence. The meaning of such a hexadecimal number is determined by ISO/IEC 10646. If the code point is in the Basic Multilingual Plane [...]. [...] To escape a code point that is not in the Basic Multilingual Plane, the character may be represented as a twelve-character sequence, encoding the UTF-16 surrogate pair corresponding to the code point.
This wording is poor and confusing: If the standard forces a semantic interpretation of UTF-16 – because
\uhhhh
of exactly 4 hex digits (ie: 2 bytes) and– in which Unicode code points above the basic multilingual plane (BMP) are represented as two 2-byte code units instead of just one, why doesn't it mention UTF-16 earlier (immediately after its mention of ISO/IEC 10646 and before discussing BMP characters) instead of within the text about non-BMP characters? (Surely UTF-16 is implied for BMP code points as well. It is unlikely that the standard was meant to allow for documents without non-BMP code points to follow a non-UTF-16 character encoding.)
Back to the earlier sentence about interpreting surrogate pairs: If I now understand it correctly, it leaves it open to the JSON processor whether to (a) keep a sequence of 2-byte code units in JSON as an equally long sequence of code units vs (b) convert it into a potentially shorter sequence of code points.
But that's just like saying that a JSON processor can store a number as a double
vs as a string
after reading it in from a JSON value – not such a deep statement. Note also that the same standards document emphasizes that it doesn't impose a particular semantic interpretation on any JSON value (though the text "The meaning of such a hexadecimal number is determined by ISO/IEC 10646." seems to lay out a partial exception). It is also telling that RFC 8259 contains no wording analogous to the quoted sentence about surrogates.
Or is there more depth to the quoted sentence?
Upvotes: 1
Views: 102