Eugene
Eugene

Reputation: 121028

What is the number of bytes that letter A will occupy in UTF-16?

I have a really simple question actually, what is the minimum size (number of bytes) that letter "A" is supposed to occupy in UTF-16 encoding when coding in Java (this should be irrelevant though).

I really thought this one was pretty simple: since UTF-16 is either 2 or 4 bytes, and letter A is well a "simple" one, the answer should be two, but then:

System.out.println("A".getBytes(StandardCharsets.UTF_8).length); // prints 1, as expected
System.out.println("A".getBytes(StandardCharsets.UTF_16).length); // prints 4, I thought it would be 2

System.out.println("AB".getBytes(StandardCharsets.UTF_8).length);// prints 2 as expected
System.out.println("AB".getBytes(StandardCharsets.UTF_16).length); // prints 6, expected 4

Can someone bring some light here?

Upvotes: 1

Views: 87

Answers (1)

Peter Lawrey
Peter Lawrey

Reputation: 533820

When you use UTF-16 it needs to define whether it is little endian or big endian. It does with with a BOM or \uFEFF or \uFFFE which two extra bytes at the start.

UTF-8 is one byte at a time so there is no byte order to keep track of.

If you use UTF-16BE or UTF-16LE it defines the order so it shouldn't need the BOM

Upvotes: 2

Related Questions