Reputation: 3890
I am trying to encode/decode a ByteArray
to String
, but input/output are not matching. Am I doing something wrong?
System.out.println(org.apache.commons.codec.binary.Hex.encodeHexString(by));
String s = new String(by, Charsets.UTF_8);
System.out.println(org.apache.commons.codec.binary.Hex.encodeHexString(s.getBytes(Charsets.UTF_8)));
The output is:
130021000061f8f0001a
130021000061efbfbd
Complete code:
String[] arr = {"13", "00", "21", "00", "00", "61", "F8", "F0", "00", "1A"};
byte[] by = new byte[arr.length];
for (int i = 0; i < arr.length; i++) {
by[i] = (byte)(Integer.parseInt(arr[i],16) & 0xff);
}
System.out.println(org.apache.commons.codec.binary.Hex.encodeHexString(by));
String s = new String(by, Charsets.UTF_8);
System.out.println(org.apache.commons.codec.binary.Hex.encodeHexString(s.getBytes(Charsets.UTF_8)));
Upvotes: 3
Views: 234
Reputation: 6287
When you build the String
from the array of bytes, the bytes are decoded.
Since the bytes from your code does not represent valid characters, the bytes that finally composes the String
are not the same your passed as parameter.
Constructs a new
String
by decoding the specified array of bytes using the platform's default charset. The length of the newString
is a function of the charset, and hence may not be equal to the length of the byte array.The behavior of this constructor when the given bytes are not valid in the default charset is unspecified. The
CharsetDecoder
class should be used when more control over the decoding process is required.
Upvotes: 2
Reputation: 49724
The problem here is that f8f0001a
isn't a valid UTF-8 byte sequence.
First of all, the f8
opening byte denotes a 5 byte sequence and you've only got four. Secondly, f8
can only be followed by a byte of 8x
, 9x
, ax
or bx
form.
Therefore it gets replaced with a unicode replacement character (U+FFFD)
, whose byte sequence in UTF-8 is efbfbd
.
And there (rightly) is no guarantee that the conversion of an invalid byte sequence to and from a string will result in the same byte sequence. (Note that even with two, seemingly identical strings, you might get different bytes representing them in Unicode, see Unicode equivalence. )
The moral of the story is: if you want to represent bytes, don't convert them to characters, and if you want to represent text, don't use byte arrays.
Upvotes: 5
Reputation: 4196
My UTF-8 is a bit rusty :-), but the sequence F8 F0
is imho not a valid utf-8 encoding.
Look at http://en.wikipedia.org/wiki/Utf-8#Description.
Upvotes: 3