Reputation: 8288
I'm writing a JSON parser in Xojo. It's working apart from the fact that I can't figure out how to encode and decode unicode strings that are not in the basic multilingual plane (BMP). In other words, my parser dies if encounters something greater than \uFFFF
.
The specs say:
To escape a code point that is not in the Basic Multilingual Plane, the character may be represented as a twelve-character sequence, encoding the UTF-16 surrogate pair corresponding to the code point. So for example, a string containing only the G clef character (U+1D11E) may be represented as "\uD834\uDD1E". However, whether a processor of JSON texts interprets such a surrogate pair as a single code point or as an explicit surrogate pair is a semantic decision that is determined by the specific processor.
What I don't understand is what is the algorithm to go from U+1D11E
to \uD834\uDD1E
. I can't find any explanation of how to "encode the UTF-16 surrogate pair corresponding to the code point".
For example, say I want to encode the smiley face character (U+1F600
). What would this be as a UTF-16 surrogate pair and what is the working to derive it?
Could somebody please at least point me in the correct direction?
Upvotes: 5
Views: 708
Reputation: 8288
Taken from the Wikipedia article linked by Remy Lebeau in the comments above (link):
To encode U+10437 (𐐷) to UTF-16:
Subtract 0x10000 from the code point, leaving 0x0437. For the high surrogate, shift right by 10 (divide by 0x400), then add 0xD800, resulting in 0x0001 + 0xD800 = 0xD801. For the low surrogate, take the low 10 bits (remainder of dividing by 0x400), then add 0xDC00, resulting in 0x0037 + 0xDC00 = 0xDC37. To decode U+10437 (𐐷) from UTF-16:
Take the high surrogate (0xD801) and subtract 0xD800, then multiply by 0x400, resulting in 0x0001 × 0x400 = 0x0400. Take the low surrogate (0xDC37) and subtract 0xDC00, resulting in 0x37. Add these two results together (0x0437), and finally add 0x10000 to get the final decoded UTF-32 code point, 0x10437.
Upvotes: 5