Shaw
Shaw

Reputation: 105

String Encoding with Emoji in Java?

I have small test example like this

    public class Main {
        public static void main(String[] args) {
            String s = "🇻🇺";
            System.out.println(s);
            System.out.println(s.length());
            System.out.println(s.toCharArray().length);
            System.out.println(s.getBytes(StandardCharsets.UTF_8).length);
            System.out.println(s.getBytes(StandardCharsets.UTF_16).length);
            System.out.println(s.codePointCount(0, s.length()));
            System.out.println(Character.codePointCount(s, 0, s.length()));
       }
    }

And result is:

🇻🇺
4
4
8
10
2
2

I can not understand, why 1 unicode character Vanuatu flag return 4 of length, 8 bytes in utf-8 and 10 bytes in utf-16, I know java using UTF-16 and it need 1 char(2 byte) for 1 code point but it make me confusing about 4 char for 1 unicode character, i think it just need 2 char but result 4. Someone can fully explain to help me understand about this. Many thanks.

Upvotes: 3

Views: 2041

Answers (2)

user14387228
user14387228

Reputation: 381

UTF-8 is a variable-length encoding that uses 1 to 4 bytes per Unicode character. The first byte carries from 3 to 7 bits of the character, and each subsequent byte carries 6 bits. Thus there's from 7 to 21 bits of payload.

The number of bytes needed depends on the particular character.

See this Wikipedia page for the encoding.

UTF-16 uses either one 16-bit unit or two 16-bit units for a Unicode character. Approximately speaking, characters in the first 64K characters are encoded as one unit; characters outside that range need two units.

"Approximately" because, actually, the codes that fit in one 16-bit unit are either in U+0000 to U+D7FF, or U+E000 to U+FFFF. The values in between those two are used for the two-unit format.

The number of 16-bit units needed depends on the particular character.

See this other Wikipedia page.

Upvotes: 1

that other guy
that other guy

Reputation: 123400

Unicode flag emojis are encoded as two code points.

There are 26 Regional Indicator Symbols representing A-Z, and a flag is encoded by spelling out the ISO country code. For example, the Vanuatu flag is encoded as "VU", and the American flag is "US".

The indicators are all in the supplemental plane, so they each require two UTF-16 characters. This brings the total up to 4 Java char per flag.

The purpose of this is to avoid having to update the standard whenever a country gains or loses independence, and it helps the Unicode consortium stay neutral since it doesn't have to be an arbiter of geopolitical claims.

Upvotes: 5

Related Questions