QBrute
QBrute

Reputation: 4534

"Negating" a String gives unexpected behaviour

I was playing around with String and its constructor and noticed some behaviour I can't explain.

I created the following method

public static String negate(String s) {
    byte[] b = s.getBytes();
    for (int i = 0; i < b.length; i++) {
        b[i] = (byte)(~b[i] + 1);
    }
    System.out.println(Arrays.toString(b));
    return new String(b);
}

which simply does a 2's complement on each byte and returns a new String for that. When calling it like

System.out.println(negate("Hello"));

I got an output of

[-72, -101, -108, -108, -111]
�����

which I guess is fine, since there are no negative ASCII values.
But when I nested the calls like so

System.out.println(negate(negate("Hello")));

my output was like this

[-72, -101, -108, -108, -111]
[17, 65, 67, 17, 65, 67, 17, 65, 67, 17, 65, 67, 17, 65, 67]
ACACACACAC // 5 groups of 3 characters (1 ctrl-char and "AC")

I expected the output to match exactly my input string "Hello", but instead I got this. Why? This also happens with every other input string. After nesting, each single character from the input becomes just AC.

I went farther and created a method that does the same thing, but only with raw byte arrays

public static byte[] n(byte[] b) {
    for (int i = 0; i < b.length; i++) {
        b[i] = (byte)(~b[i] + 1);
    }
    System.out.println(Arrays.toString(b));
    return b;
}

Here the output is as expected. For

System.out.println(new String(n(n("Hello".getBytes()))));

I get

[-72, -101, -108, -108, -111]
[72, 101, 108, 108, 111]
Hello

So I guess it has to do with the way Strings are created, since it only happened when I called negate with an instance that already got the negative bytes?

I even walked down the class tree to look at the internal classes but I couldn't find where this behaviour comes from.

Also in the docs of String there's following paragraph which might be an explanation:

The behavior of this constructor when the given bytes are not valid in the default charset is unspecified

Can anybody tell me why it's like this and what exactly is happening here?

Upvotes: 3

Views: 196

Answers (2)

T.J. Crowder
T.J. Crowder

Reputation: 1074495

The issue is you're taking the inverted bytes and trying to interpret them as a valid byte stream in the default character set (remember, characters are not bytes). So as the string constructor docs you quoted tell you, the result is unspecified, and probably involves error-correction, dropping invalid values, etc., etc. Naturally, then, it's a lossy process, and reversing it will not get you back your original string.

If you get the bytes and double-negate them without converting the intermediate bytes to string, you'll get back your original result.

This example demonstrates the lossy nature of new String(/*invalid bytes*/):

String s = "Hello";
byte[] b = s.getBytes();
for (int i = 0; i < b.length; i++) {
    b[i] = (byte)(~b[i] + 1);
}
// Show the negated bytes
System.out.println(Arrays.toString(b));
String s2 = new String(b);
// Show the bytes of the string constructed from them; note they're not the same
System.out.println(Arrays.toString(s2.getBytes()));

On my system, which I believe defaults to UTF-8, I get:

[-72, -101, -108, -108, -111]
[-17, -65, -67, -17, -65, -67, -17, -65, -67, -17, -65, -67, -17, -65, -67]

Note what happened when I took the invalid byte stream, made a string out of it, and then got the bytes of that string.

Upvotes: 4

Kayaman
Kayaman

Reputation: 73558

You "negate" a character and it becomes invalid. Then you get the placeholder (U+FFFD). At this point everything is corrupted. Then you "negate" that, and you get your AC from each of the placeholder chars.

Upvotes: 2

Related Questions