Aura Lee
Aura Lee

Reputation: 466

Java char to bytes and back is converted wrongly (UTF-8)

While programming I encountered a weird behavior of Strings that are converted to bytes and then back to Strings again. Some chars are converted wrongly, and therefore the hashCode of the String is also changed. The length of the Strings remain the same. The problem seems to occur with chars from 55296 - 57343 (U+D800 to U+DFFF). Other chars work fine. Is it because they are surrogates?

String string = new String(new char[] { 56000 });
System.out.println((int)string.charAt(0));
System.out.println((int)new String(string.getBytes(StandardCharsets.UTF_8), StandardCharsets.UTF_8).charAt(0));

The console output is:

56000
63

What is going on here? Is this a java bug, or am I misunderstanding something?

Upvotes: 2

Views: 250

Answers (1)

Henry
Henry

Reputation: 43738

That's because these values are not characters but surrogates. Two of these values form a surrogate pair that in turn represents one character. If you have just one low or high surrogate value this is an invalid encoding and not a character.

Since this is an invalid encoding, it is replaced by a "?" character when you convert it to UTF-8.

You can read more about it for example here https://en.wikipedia.org/wiki/UTF-16

Upvotes: 1

Related Questions