Reputation: 600
I am trying to construct a char by using codepoints within either high-surrogates range, (\uD800-\uDBFF) or low-surrogate range, according to https://docs.oracle.com/javase/8/docs/api/java/lang/Character.html the code sample is
for (int cp = Character.MIN_SURROGATE; cp <= Character.MAX_SURROGATE; cp++) {
char c = (char) cp;
if (Character.isHighSurrogate(c)) {
char low = Character.lowSurrogate(cp);
System.out.println(Character.isSurrogatePair(c, low));
} else if (Character.isLowSurrogate(c)) {
char high = Character.highSurrogate(cp);
System.out.println(Character.isSurrogatePair(high, c));
}
}
What confuses me is that all high-surrogate char can find the corresponding low surrogate pair by callingCharacter.lowSurrogate
and Character.isSurrogatePair(c, low)
is always true but the low-surrogate char can not find corresponding high surrogate pair and Character.highSurrogat
always return the same char with code point 55287
.
Any idea how to construct a valid character given a low surrogate char?
Upvotes: 0
Views: 846
Reputation: 596256
You have a fundamental misunderstanding of what surrogates are and how they work together. I suggest you read up on how UTF-16 actually works.
Character.is(High|Low)Surrogate()
take a UTF-16 codeunit as input. If the result is true
, calling Character.(high|low)Surrogate()
on the same numeric value doesn't do what you think it does. You can't find a low surrogate from a high surrogate, and vice versa. Any high surrogate can be combined with many different low surrogates (ie, D800 DC00
, D800 DC01
, D800 DC02
, ...), and vice versa any low surrogate can be combined with many different high surrogates (ie, D800 DC00
, D801 DC00
, D802 DC00
, ...), to produce different Unicode codepoints.
Character.(high|low)Surrogate()
take a Unicode codepoint as input and return the high/low surrogates needed to encode that codepoint in UTF-16, respectively.
You are enumerating codeunits and treating them as-if they were codepoints, which they are not (codepoints in the range of UTF-16 surrogates are reserved and thus invalid for use in any context). IOW, you are mixing apples and tangerines, so of course the results are not what you are expecting.
Let's take a deeper look at your code example in context. Let's look at the two extremes that your loop processes, Character.MIN_SURROGATE
('\uD800'
) and Character.MAX_SURROGATE
('\uDFFF'
). The same will be true for all other values in between:
isHighSurrogate((char)0xD800)
returns true
and Character.isLowSurrogate((char)0xD800)
returns false
, because codeunit D800
is in the range of high surrogate codeunits (D800..DBFF
).
isHighSurrogate((char)0xDFFF)
returns false
and Character.isLowSurrogate((char)0xDFFF)
returns true
, because codeunit DFFF
is in the range of low surrogate codeunits (DC00..DFFF
).
Character.lowSurrogate((int)0xD800)
and Character.highSurrogate((int)0xDFFF)
both return an unspecified char
value, because 0xD800
and 0xDFFF
are not valid "supplementary characters" (Unicode codepoints) in the range of Character.MIN_SUPPLEMENTARY_CODE_POINT..Character.MAX_CODE_POINT
(U+10000..U+10FFFF
), ie Character.isSupplementaryCodePoint()
returns false
. Per Java's documentation:
Returns the trailing surrogate (a low surrogate code unit) of the surrogate pair representing the specified supplementary character (Unicode code point) in the UTF-16 encoding. If the specified character is not a supplementary character, an unspecified
char
is returned.
Returns the leading surrogate (a high surrogate code unit) of the surrogate pair representing the specified supplementary character (Unicode code point) in the UTF-16 encoding. If the specified character is not a supplementary character, an unspecified
char
is returned.
Determines whether the specified character (Unicode code point) is in the supplementary character range.
...
Returns:
true if the specified code point is betweenMIN_SUPPLEMENTARY_CODE_POINT
andMAX_CODE_POINT
inclusive; false otherwise.
So, generally speaking, if Character.isHighSurrogate()
returns true
, Character.lowSurrogate()
for the same numeric value is undefined. And if Character.isLowSurrogate()
returns true
, Character.highSurrogate()
for the same numeric value is undefined.
Thus, your claims that "all high-surrogate char can find the corresponding low surrogate pair by calling Character.lowSurrogate
" and "Character.isSurrogatePair(c, low)
is always true" are both wrong.
And your claim that "the low-surrogate char can not find corresponding high surrogate pair" is correct, but your claim that "Character.highSurrogate
always return the same char with code point 55287" is wrong.
Lastly, regarding your actual question:
Any idea how to construct a valid character given a low surrogate char?
You can't construct a Unicode codepoint from just a high or low surrogate alone. It is the combination of high+low surrogates acting together that defines a specific codepoint. You need both surrogates. If you only have 1 surrogate, then you don't have enough bits to re-construct the codepoint. Period.
Upvotes: 1