Reputation: 884
I'm using this code to convert a UTF-8 String
to binary:
public String toBinary(String str) {
byte[] buf = str.getBytes(StandardCharsets.UTF_8);
StringBuilder result = new StringBuilder();
for (int i = 0; i < buf.length; i++) {
int ch = (int) buf[i];
String binary = Integer.toBinaryString(ch);
result.append(("00000000" + binary).substring(binary.length()));
result.append(' ');
}
return result.toString().trim();
}
Before I was using this code:
private String toBinary2(String str) {
StringBuilder result = new StringBuilder();
for (int i = 0; i < str.length(); i++) {
int ch = (int) str.charAt(i);
String binary = Integer.toBinaryString(ch);
if (ch<256)
result.append(("00000000" + binary).substring(binary.length()));
else {
binary = ("0000000000000000" + binary).substring(binary.length());
result.append(binary.substring(0, 8));
result.append(' ');
result.append(binary.substring(8));
}
result.append(' ');
}
return result.toString().trim();
}
These two method can return different results; for example:
toBinary("è") = "11000011 10101000"
toBinary2("è") = "11101000"
I think that because the bytes of è
are negative while the corresponding char
is not (because char is a 2 byte unsigned integer).
What I want to know is: which of the two approaches is the correct one and why?
Thanks in advance.
Upvotes: 0
Views: 1336
Reputation: 179
This code snippet might help.
String s = "Some String";
byte[] bytes = s.getBytes();
StringBuilder binary = new StringBuilder();
for(byte b:bytes){
int val =b;
for(int i=;i<=s.length;i++){
binary.append((val & 128) == 0 ? 0 : 1);
val<<=1;
}
}
System.out.println(" "+s+ "to binary" +binary);
Upvotes: 0
Reputation: 308041
Whenever you want to convert text into binary data (or into text representing binary data, as you do here) you have to use some encoding.
Your toBinary
uses UTF-8 for that encoding.
Your toBinary2
uses something that's not a standard encoding: it encodes every UTF-16 codepoint * <= 256 in a single byte and all others in 2 bytes. Unfortunately that one is not a useful encoding, since for decoding you'll have to know if a single byte is stand-alone or part of a 2-byte sequence (UTF-8/UTF-16 do that by indicating with the highest-level bits which one it is).
tl;dr toBinary
seems correct, toBinary2
will produce output that can't uniquely be decoded back to the original string.
* You might be wondering where the mention of UTF-16 comes from: That's because all String
objects in Java are implicitly encoded in UTF-16. So if you use charAt
you get UTF-16 codepoints (which just so happen to be equal to the Unicode code number for all characters that fit into the Basic Multilingual Plane).
Upvotes: 1