dems98
dems98

Reputation: 884

Java - What is the proper way to convert a UTF-8 String to binary?

I'm using this code to convert a UTF-8 String to binary:

public String toBinary(String str) {
    byte[] buf = str.getBytes(StandardCharsets.UTF_8);
    StringBuilder result = new StringBuilder();
    for (int i = 0; i < buf.length; i++) {
        int ch = (int) buf[i];
        String binary = Integer.toBinaryString(ch);
        result.append(("00000000" + binary).substring(binary.length()));
        result.append(' ');
    }
    return result.toString().trim();
}

Before I was using this code:

private String toBinary2(String str) {
    StringBuilder result = new StringBuilder();
    for (int i = 0; i < str.length(); i++) {
        int ch = (int) str.charAt(i);
        String binary = Integer.toBinaryString(ch);
        if (ch<256)
           result.append(("00000000" + binary).substring(binary.length()));
        else {
           binary = ("0000000000000000" + binary).substring(binary.length());
           result.append(binary.substring(0, 8));
           result.append(' ');
           result.append(binary.substring(8));
        }
        result.append(' ');
    }
    return result.toString().trim();
}

These two method can return different results; for example:

toBinary("è") = "11000011 10101000"
toBinary2("è") = "11101000"

I think that because the bytes of è are negative while the corresponding char is not (because char is a 2 byte unsigned integer).
What I want to know is: which of the two approaches is the correct one and why?
Thanks in advance.

Upvotes: 0

Views: 1336

Answers (2)

IVAN PAUL
IVAN PAUL

Reputation: 179

This code snippet might help.

String s = "Some String";
byte[] bytes = s.getBytes();
StringBuilder binary = new StringBuilder();
for(byte b:bytes){
    int val =b;
    for(int i=;i<=s.length;i++){
        binary.append((val & 128) == 0 ? 0 : 1);
        val<<=1;
    }
}
System.out.println(" "+s+ "to binary" +binary);

Upvotes: 0

Joachim Sauer
Joachim Sauer

Reputation: 308041

Whenever you want to convert text into binary data (or into text representing binary data, as you do here) you have to use some encoding.

Your toBinary uses UTF-8 for that encoding.

Your toBinary2 uses something that's not a standard encoding: it encodes every UTF-16 codepoint * <= 256 in a single byte and all others in 2 bytes. Unfortunately that one is not a useful encoding, since for decoding you'll have to know if a single byte is stand-alone or part of a 2-byte sequence (UTF-8/UTF-16 do that by indicating with the highest-level bits which one it is).

tl;dr toBinary seems correct, toBinary2 will produce output that can't uniquely be decoded back to the original string.

* You might be wondering where the mention of UTF-16 comes from: That's because all String objects in Java are implicitly encoded in UTF-16. So if you use charAt you get UTF-16 codepoints (which just so happen to be equal to the Unicode code number for all characters that fit into the Basic Multilingual Plane).

Upvotes: 1

Related Questions