dashenswen
dashenswen

Reputation: 600

how to construct string using surrogate pairs

I am trying to construct a char by using codepoints within either high-surrogates range, (\uD800-\uDBFF) or low-surrogate range, according to https://docs.oracle.com/javase/8/docs/api/java/lang/Character.html the code sample is

      for (int cp = Character.MIN_SURROGATE; cp <= Character.MAX_SURROGATE; cp++) {
            char c = (char) cp;
            if (Character.isHighSurrogate(c)) {
                char low = Character.lowSurrogate(cp);
                System.out.println(Character.isSurrogatePair(c, low));
            } else if (Character.isLowSurrogate(c)) {
                char high = Character.highSurrogate(cp);
                System.out.println(Character.isSurrogatePair(high, c));
            }
        }

What confuses me is that all high-surrogate char can find the corresponding low surrogate pair by callingCharacter.lowSurrogate and Character.isSurrogatePair(c, low) is always true but the low-surrogate char can not find corresponding high surrogate pair and Character.highSurrogat always return the same char with code point 55287.

Any idea how to construct a valid character given a low surrogate char?

Upvotes: 0

Views: 846

Answers (1)

Remy Lebeau
Remy Lebeau

Reputation: 596256

You have a fundamental misunderstanding of what surrogates are and how they work together. I suggest you read up on how UTF-16 actually works.

Character.is(High|Low)Surrogate() take a UTF-16 codeunit as input. If the result is true, calling Character.(high|low)Surrogate() on the same numeric value doesn't do what you think it does. You can't find a low surrogate from a high surrogate, and vice versa. Any high surrogate can be combined with many different low surrogates (ie, D800 DC00, D800 DC01, D800 DC02, ...), and vice versa any low surrogate can be combined with many different high surrogates (ie, D800 DC00, D801 DC00, D802 DC00, ...), to produce different Unicode codepoints.

Character.(high|low)Surrogate() take a Unicode codepoint as input and return the high/low surrogates needed to encode that codepoint in UTF-16, respectively.

You are enumerating codeunits and treating them as-if they were codepoints, which they are not (codepoints in the range of UTF-16 surrogates are reserved and thus invalid for use in any context). IOW, you are mixing apples and tangerines, so of course the results are not what you are expecting.

Let's take a deeper look at your code example in context. Let's look at the two extremes that your loop processes, Character.MIN_SURROGATE ('\uD800') and Character.MAX_SURROGATE ('\uDFFF'). The same will be true for all other values in between:

  • isHighSurrogate((char)0xD800) returns true and Character.isLowSurrogate((char)0xD800) returns false, because codeunit D800 is in the range of high surrogate codeunits (D800..DBFF).

  • isHighSurrogate((char)0xDFFF) returns false and Character.isLowSurrogate((char)0xDFFF) returns true, because codeunit DFFF is in the range of low surrogate codeunits (DC00..DFFF).

  • Character.lowSurrogate((int)0xD800) and Character.highSurrogate((int)0xDFFF) both return an unspecified char value, because 0xD800 and 0xDFFF are not valid "supplementary characters" (Unicode codepoints) in the range of Character.MIN_SUPPLEMENTARY_CODE_POINT..Character.MAX_CODE_POINT (U+10000..U+10FFFF), ie Character.isSupplementaryCodePoint() returns false. Per Java's documentation:

lowSurrogate()

Returns the trailing surrogate (a low surrogate code unit) of the surrogate pair representing the specified supplementary character (Unicode code point) in the UTF-16 encoding. If the specified character is not a supplementary character, an unspecified char is returned.

highSurrogate()

Returns the leading surrogate (a high surrogate code unit) of the surrogate pair representing the specified supplementary character (Unicode code point) in the UTF-16 encoding. If the specified character is not a supplementary character, an unspecified char is returned.

isSupplementaryCodePoint()

Determines whether the specified character (Unicode code point) is in the supplementary character range.

...

Returns:
true if the specified code point is between MIN_SUPPLEMENTARY_CODE_POINT and MAX_CODE_POINT inclusive; false otherwise.

So, generally speaking, if Character.isHighSurrogate() returns true, Character.lowSurrogate() for the same numeric value is undefined. And if Character.isLowSurrogate() returns true, Character.highSurrogate() for the same numeric value is undefined.

Thus, your claims that "all high-surrogate char can find the corresponding low surrogate pair by calling Character.lowSurrogate" and "Character.isSurrogatePair(c, low) is always true" are both wrong.

And your claim that "the low-surrogate char can not find corresponding high surrogate pair" is correct, but your claim that "Character.highSurrogate always return the same char with code point 55287" is wrong.

Lastly, regarding your actual question:

Any idea how to construct a valid character given a low surrogate char?

You can't construct a Unicode codepoint from just a high or low surrogate alone. It is the combination of high+low surrogates acting together that defines a specific codepoint. You need both surrogates. If you only have 1 surrogate, then you don't have enough bits to re-construct the codepoint. Period.

Upvotes: 1

Related Questions