Kuldeep Singh
Kuldeep Singh

Reputation: 29

String byte encoding issue

Given that I have following function

static void fun(String str) {
        System.out.println(String.format("%s | length in String: %d | length in bytes: %d | bytes: %s", str, str.length(), str.getBytes().length, Arrays.toString(str.getBytes())));
    }

on invoking fun("ó"); its output is

ó | length in String: 1 | length in bytes: 2 | bytes: [-61, -77]

so it means character ó needs 2 bytes to represent and as per Character class documentation too default is UTF-16 in java, considering that when I do following

System.out.println(new String("ó".getBytes(), StandardCharsets.UTF_16));// output=쎳
System.out.println(new String("ó".getBytes(), StandardCharsets.ISO_8859_1));// output=ó
System.out.println(new String("ó".getBytes(), StandardCharsets.US_ASCII));// output=��
System.out.println(new String("ó".getBytes(), StandardCharsets.UTF_8));// output=ó
System.out.println(new String("ó".getBytes(), StandardCharsets.UTF_16BE));// output=쎳
System.out.println(new String("ó".getBytes(), StandardCharsets.UTF_16LE));// output=돃

Why any of UTF_16, UTF_16BE, UTF_16LE charset not able to decode bytes properly, given that bytes are representing a 16 bit length character? And how UTF-8 is able decode it properly given that UTF-8 consider each character only 8 bit long so it should have printed 2 chars(1 char for each byte) like in ISO_8859_1.

Upvotes: 0

Views: 436

Answers (1)

Sweeper
Sweeper

Reputation: 270758

getBytes always returns the bytes encoded in the platform's default charset, which is probably UTF-8 for you.

Encodes this String into a sequence of bytes using the platform's default charset, storing the result into a new byte array.

So you are essentially trying to decode a bunch of UTF-8 bytes with non-UTF-8 charsets. No wonder you don't get expected results.

Though kind of pointless, you can get what you want by passing the desired charset to getBytes, so that the string is encoded correctly.

    System.out.println(new String("ó".getBytes(StandardCharsets.UTF_16), StandardCharsets.UTF_16));
    System.out.println(new String("ó".getBytes(StandardCharsets.ISO_8859_1), StandardCharsets.ISO_8859_1));
    System.out.println(new String("ó".getBytes(StandardCharsets.US_ASCII), StandardCharsets.US_ASCII));
    System.out.println(new String("ó".getBytes(StandardCharsets.UTF_8), StandardCharsets.UTF_8));
    System.out.println(new String("ó".getBytes(StandardCharsets.UTF_16BE), StandardCharsets.UTF_16BE));
    System.out.println(new String("ó".getBytes(StandardCharsets.UTF_16LE), StandardCharsets.UTF_16LE));

You also seem to have some misunderstanding about encodings. It's not just about the number of bytes that a character takes. The byte-count-per-character for two encodings being the same doesn't mean that they are compatible with each other. Also, it is not always one byte per character in UTF-8. UTF-8 is a variable-length encoding.

Upvotes: 5

Related Questions