Margarita Spasskaya
Margarita Spasskaya

Reputation: 663

Java convert unicode code point to string

How can UTF-8 value like =D0=93=D0=B0=D0=B7=D0=B5=D1=82=D0=B0 be converted in Java?

I have tried something like:

Character.toCodePoint((char)(Integer.parseInt("D0", 16)),(char)(Integer.parseInt("93", 16));

but it does not convert to a valid code point.

Upvotes: 2

Views: 4847

Answers (2)

Andreas
Andreas

Reputation: 159086

That string is an encoding of bytes in hex, so the best way is to decode the string into a byte[], then call new String(bytes, StandardCharsets.UTF_8).

Update

Here is a slightly more direct version of decoding the string, than provided by "sstan" in another answer. Of course both versions are good, so use whichever makes you more comfortable, or write your own version.

String src = "=D0=93=D0=B0=D0=B7=D0=B5=D1=82=D0=B0";

assert src.length() % 3 == 0;
byte[] bytes = new byte[src.length() / 3];
for (int i = 0, j = 0; i < bytes.length; i++, j+=3) {
    assert src.charAt(j) == '=';
    bytes[i] = (byte)(Character.digit(src.charAt(j + 1), 16) << 4 |
                      Character.digit(src.charAt(j + 2), 16));
}
String str = new String(bytes, StandardCharsets.UTF_8);

System.out.println(str);

Output

Газета

Upvotes: 4

sstan
sstan

Reputation: 36483

In UTF-8, a single character is not always encoded with the same amount of bytes. Depending on the character, it may require 1, 2, 3, or even 4 bytes to be encoded. Therefore, it's definitely not a trivial matter to try to map UTF-8 bytes yourself to a Java char which uses UTF-16 encoding, where each char is encoded using 2 bytes. Not to mention that, depending on the character (code point > 0xffff), you may also have to worry about dealing with surrogate characters, which is just one more complication that you can easily get wrong.

All this to say that Andreas is absolutely right. You should focus on parsing your string to a byte array, and then let the built-in libraries convert the UTF-8 bytes to a Java string for you. From a Java String, it's trivial to extract the Unicode code points if that's what you want.

Here is some sample code that shows one way this can be achieved:

public static void main(String[] args) throws Exception {
    String src = "=D0=93=D0=B0=D0=B7=D0=B5=D1=82=D0=B0";

    // Parse string into hex string tokens.
    String[] tokens = Arrays.stream(src.split("="))
            .filter(s -> s.length() != 0)
            .toArray(String[]::new);

    // Convert the hex string representations to a byte array.
    byte[] utf8bytes = new byte[tokens.length];
    for (int i = 0; i < utf8bytes.length; i++) {
        utf8bytes[i] = (byte) Integer.parseInt(tokens[i], 16);
    }

    // Convert UTF-8 bytes to Java String.
    String str = new String(utf8bytes, StandardCharsets.UTF_8);

    // Display string + individual unicode code points.
    System.out.println(str);
    str.codePoints().forEach(System.out::println);
}

Output:

Газета
1043
1072
1079
1077
1090
1072

Upvotes: 1

Related Questions