Java - What is the proper way to convert a UTF-8 String to binary?

Question

I'm using this code to convert a UTF-8 String to binary:

public String toBinary(String str) {
    byte[] buf = str.getBytes(StandardCharsets.UTF_8);
    StringBuilder result = new StringBuilder();
    for (int i = 0; i < buf.length; i++) {
        int ch = (int) buf[i];
        String binary = Integer.toBinaryString(ch);
        result.append(("00000000" + binary).substring(binary.length()));
        result.append(' ');
    }
    return result.toString().trim();
}

Before I was using this code:

private String toBinary2(String str) {
    StringBuilder result = new StringBuilder();
    for (int i = 0; i < str.length(); i++) {
        int ch = (int) str.charAt(i);
        String binary = Integer.toBinaryString(ch);
        if (ch<256)
           result.append(("00000000" + binary).substring(binary.length()));
        else {
           binary = ("0000000000000000" + binary).substring(binary.length());
           result.append(binary.substring(0, 8));
           result.append(' ');
           result.append(binary.substring(8));
        }
        result.append(' ');
    }
    return result.toString().trim();
}

These two method can return different results; for example:

toBinary("è") = "11000011 10101000"
toBinary2("è") = "11101000"

I think that because the bytes of è are negative while the corresponding char is not (because char is a 2 byte unsigned integer).
What I want to know is: which of the two approaches is the correct one and why?
Thanks in advance.

Joachim Sauer · Accepted Answer

Whenever you want to convert text into binary data (or into text representing binary data, as you do here) you have to use some encoding.

Your toBinary uses UTF-8 for that encoding.

Your toBinary2 uses something that's not a standard encoding: it encodes every UTF-16 codepoint ^* <= 256 in a single byte and all others in 2 bytes. Unfortunately that one is not a useful encoding, since for decoding you'll have to know if a single byte is stand-alone or part of a 2-byte sequence (UTF-8/UTF-16 do that by indicating with the highest-level bits which one it is).

tl;dr toBinary seems correct, toBinary2 will produce output that can't uniquely be decoded back to the original string.

^{* You might be wondering where the mention of UTF-16 comes from: That's because all String objects in Java are implicitly encoded in UTF-16. So if you use charAt you get UTF-16 codepoints (which just so happen to be equal to the Unicode code number for all characters that fit into the Basic Multilingual Plane).}

Java - What is the proper way to convert a UTF-8 String to binary?

Answers (2)

Related Questions