Reputation: 4534
I was playing around with String
and its constructor and noticed some behaviour I can't explain.
I created the following method
public static String negate(String s) {
byte[] b = s.getBytes();
for (int i = 0; i < b.length; i++) {
b[i] = (byte)(~b[i] + 1);
}
System.out.println(Arrays.toString(b));
return new String(b);
}
which simply does a 2's complement on each byte
and returns a new String
for that. When calling it like
System.out.println(negate("Hello"));
I got an output of
[-72, -101, -108, -108, -111]
�����
which I guess is fine, since there are no negative ASCII values.
But when I nested the calls like so
System.out.println(negate(negate("Hello")));
my output was like this
[-72, -101, -108, -108, -111]
[17, 65, 67, 17, 65, 67, 17, 65, 67, 17, 65, 67, 17, 65, 67]
ACACACACAC // 5 groups of 3 characters (1 ctrl-char and "AC")
I expected the output to match exactly my input string "Hello"
, but instead I got this. Why? This also happens with every other input string. After nesting, each single character from the input becomes just AC
.
I went farther and created a method that does the same thing, but only with raw byte
arrays
public static byte[] n(byte[] b) {
for (int i = 0; i < b.length; i++) {
b[i] = (byte)(~b[i] + 1);
}
System.out.println(Arrays.toString(b));
return b;
}
Here the output is as expected. For
System.out.println(new String(n(n("Hello".getBytes()))));
I get
[-72, -101, -108, -108, -111]
[72, 101, 108, 108, 111]
Hello
So I guess it has to do with the way String
s are created, since it only happened when I called negate
with an instance that already got the negative byte
s?
I even walked down the class tree to look at the internal classes but I couldn't find where this behaviour comes from.
Also in the docs of String there's following paragraph which might be an explanation:
The behavior of this constructor when the given bytes are not valid in the default charset is unspecified
Can anybody tell me why it's like this and what exactly is happening here?
Upvotes: 3
Views: 196
Reputation: 1074495
The issue is you're taking the inverted bytes and trying to interpret them as a valid byte stream in the default character set (remember, characters are not bytes). So as the string constructor docs you quoted tell you, the result is unspecified, and probably involves error-correction, dropping invalid values, etc., etc. Naturally, then, it's a lossy process, and reversing it will not get you back your original string.
If you get the bytes and double-negate them without converting the intermediate bytes to string, you'll get back your original result.
This example demonstrates the lossy nature of new String(/*invalid bytes*/)
:
String s = "Hello";
byte[] b = s.getBytes();
for (int i = 0; i < b.length; i++) {
b[i] = (byte)(~b[i] + 1);
}
// Show the negated bytes
System.out.println(Arrays.toString(b));
String s2 = new String(b);
// Show the bytes of the string constructed from them; note they're not the same
System.out.println(Arrays.toString(s2.getBytes()));
On my system, which I believe defaults to UTF-8, I get:
[-72, -101, -108, -108, -111] [-17, -65, -67, -17, -65, -67, -17, -65, -67, -17, -65, -67, -17, -65, -67]
Note what happened when I took the invalid byte stream, made a string out of it, and then got the bytes of that string.
Upvotes: 4
Reputation: 73558
You "negate" a character and it becomes invalid. Then you get the placeholder �
(U+FFFD). At this point everything is corrupted. Then you "negate" that, and you get your AC
from each of the placeholder chars.
Upvotes: 2