Reputation: 24710
I always believed that java uses UTF-16
to encode its characters internally. This is confirmed by the fact that it uses u+xxxx
format to represent character codes and the fact that it uses 16 bits to store a char
.
But sometimes UTF-16
needs more than 2 bytes. In that case java needs 2 char
to represent 1 UTF-16
character.
On a sidenote: This makes me wonder if it's more correct to say that "java just supports the Unicode character set, and uses 16-bit cells to store characters".
Question: Does the first char
offer some method to determine that a second char
is used, or that the 2 belong together?
Upvotes: 3
Views: 1822
Reputation: 113242
Yes, UTF-16 was invented when Unicode expanded pass the 65536 code point limit of Unicode 1.0 to the 1114112 code point limit it has today.
This allows it to support the entire Universal Character Set while maintaining compatibility with UCS-2; the obsolete encoding of all Unicode characters as two-byte units that is obsolete precisely because it cannot encode all Unicode characters in Unicode 2.0 or later.
Does the first char offer some method to determine that a second char is used, or that the 2 belong together?
Yes, in UTF-16, a two-byte unit is either:
0xD800
and 0xDBFF
inclusive, isHighSurrogate
will return true
.0xDC00
and 0xDFFF
inclusive, isLowSurrogate
will return true
.Non-surrogate map directly with a BMP character of the same code point.
Surrogates combine to represent astral plane characters:
In Java you can do this by first checking isBmpCodePoint
on an int
with the codepoint. If that is true then you can just cast it to char
to get the single UTF-16 unit that encodes it. Otherwise you can call highSurrogate
to get the first char
and lowSurrogate
to get the second.
As well as isBmpCodePoint
you could use charCount
which returns 1
for BMP characters and 2
if you need surrogates. This is useful if you are going to create an array of either 1
or 2
characters to hold the value.
Since the surrogate code points are never assigned characters, this means the encoding is unambiguous for the entire Universal Character Set.
It's also self correcting, a mistake in the stream can be isolated rather than leading to all further characters being misread. E.g. If we find an isolated low surrogate we know that bit is wrong but can still read the rest of the stream.
Some full examples, but I'm not too hot in Java (Unicode on the other hand, I know well, and that's the knowledge I used to answer this), so if someone spots a n00b Java error but thinks I got the Unicode-knowledge part correct please just go ahead and edit this post accordingly:
"𐌞"
is a string with a single Unicode character, U+10300
which is a letter from the Old Italic Alphabet. For the most part, these "Astral Planes" characters as they're semi-jokingly called are relatively obscure as the Unicode Consortium try to be as useful as they can without going outside the easier-to-use BMP (Basic Multilingual Plane; U+0000
to U+FFFF
, though sometimes listed as "U+0000
to U+FFFD
as U+FFFE
and U+FFFF
are both non-characters and shouldn't be used in most cases).
(If you're experimenting with this, then those that use 𐌞
directly will depend on how well your text editor copes with it).
If you examine "𐌞".length
you'll get 2
because length
gives you the number of UTF-16 encoding units, not the number of characters.
new StringBuilder().appendCodePoint(0x10300).toString() == "𐌞"
should return true
.
Character.charCount(0x10300)
will return 2
as we need two UTF-16 char
to encode it. Character.isBmpCodePoint(0x10300)
will return false
.
Character.codePointAt("𐌞", 0)
will return 66304
which is 0x10300
, because when it sees a high surrogate it includes reading the following low surrogate in the calculation.
Character.highSurrogate(0x10300) == 0xD800 && Character.lowSurrogate(0x10300) == 0xDF00
is true, as those are the high and low surrogates the character should be split into to encode in UTF-16.
Likewise "𐌞".charAt(0) == 0xD800 && "𐌞".charAt(1) == 0xDF00
because charAt
deals with UTF-16 units, not Unicode characters.
By the same token "𐌞" == "\uD800\uDF00"
which uses escapes for the two surrogates.
Upvotes: 7