bvdb
bvdb

Reputation: 24710

UTF-16 split in 2 chars

I always believed that java uses UTF-16 to encode its characters internally. This is confirmed by the fact that it uses u+xxxx format to represent character codes and the fact that it uses 16 bits to store a char.

But sometimes UTF-16 needs more than 2 bytes. In that case java needs 2 char to represent 1 UTF-16 character.

On a sidenote: This makes me wonder if it's more correct to say that "java just supports the Unicode character set, and uses 16-bit cells to store characters".

Question: Does the first char offer some method to determine that a second char is used, or that the 2 belong together?

Upvotes: 3

Views: 1822

Answers (1)

Jon Hanna
Jon Hanna

Reputation: 113242

Yes, UTF-16 was invented when Unicode expanded pass the 65536 code point limit of Unicode 1.0 to the 1114112 code point limit it has today.

This allows it to support the entire Universal Character Set while maintaining compatibility with UCS-2; the obsolete encoding of all Unicode characters as two-byte units that is obsolete precisely because it cannot encode all Unicode characters in Unicode 2.0 or later.

Does the first char offer some method to determine that a second char is used, or that the 2 belong together?

Yes, in UTF-16, a two-byte unit is either:

  1. A high surrogate, which must always be followed by a low surrogate. Between 0xD800 and 0xDBFF inclusive, isHighSurrogate will return true.
  2. A low surrogate that must always follow a high surrogate. Between 0xDC00 and 0xDFFF inclusive, isLowSurrogate will return true.
  3. A non-surrogate.

Non-surrogate map directly with a BMP character of the same code point.

Surrogates combine to represent astral plane characters:

  1. Subtract 0x010000 from the code point.
  2. Add the top 10 bits to 0xD800 to get the high surrogate.
  3. Add the lower 10 bits to 0xDC00 to get the low surrogate.

In Java you can do this by first checking isBmpCodePoint on an int with the codepoint. If that is true then you can just cast it to char to get the single UTF-16 unit that encodes it. Otherwise you can call highSurrogate to get the first char and lowSurrogate to get the second.

As well as isBmpCodePoint you could use charCount which returns 1 for BMP characters and 2 if you need surrogates. This is useful if you are going to create an array of either 1 or 2 characters to hold the value.

Since the surrogate code points are never assigned characters, this means the encoding is unambiguous for the entire Universal Character Set.

It's also self correcting, a mistake in the stream can be isolated rather than leading to all further characters being misread. E.g. If we find an isolated low surrogate we know that bit is wrong but can still read the rest of the stream.

Some full examples, but I'm not too hot in Java (Unicode on the other hand, I know well, and that's the knowledge I used to answer this), so if someone spots a n00b Java error but thinks I got the Unicode-knowledge part correct please just go ahead and edit this post accordingly:

"𐌞" is a string with a single Unicode character, U+10300 which is a letter from the Old Italic Alphabet. For the most part, these "Astral Planes" characters as they're semi-jokingly called are relatively obscure as the Unicode Consortium try to be as useful as they can without going outside the easier-to-use BMP (Basic Multilingual Plane; U+0000 to U+FFFF, though sometimes listed as "U+0000 to U+FFFD as U+FFFE and U+FFFF are both non-characters and shouldn't be used in most cases).

(If you're experimenting with this, then those that use 𐌞 directly will depend on how well your text editor copes with it).

If you examine "𐌞".length you'll get 2 because length gives you the number of UTF-16 encoding units, not the number of characters.

new StringBuilder().appendCodePoint(0x10300).toString() == "𐌞" should return true.

Character.charCount(0x10300) will return 2 as we need two UTF-16 char to encode it. Character.isBmpCodePoint(0x10300) will return false.

Character.codePointAt("𐌞", 0) will return 66304 which is 0x10300, because when it sees a high surrogate it includes reading the following low surrogate in the calculation.

Character.highSurrogate(0x10300) == 0xD800 && Character.lowSurrogate(0x10300) == 0xDF00 is true, as those are the high and low surrogates the character should be split into to encode in UTF-16.

Likewise "𐌞".charAt(0) == 0xD800 && "𐌞".charAt(1) == 0xDF00 because charAt deals with UTF-16 units, not Unicode characters.

By the same token "𐌞" == "\uD800\uDF00" which uses escapes for the two surrogates.

Upvotes: 7

Related Questions