Reputation: 33
I'm trying to serialize a java string into an array of bytes and then deserialize the array into a string again. It seemed to work OK until I tested of a the unicode character \ude4e
. For some reason, the original string "\ud34e"
is not equal to the deserialized string.
This is the serialization code (where encoding = Charset.forName( "UTF-16BE" )
and str = "\ud34e"
)
ByteArrayOutputStream out = new ByteArrayOutputStream();
Writer temp = new OutputStreamWriter( out, encoding );
temp.write( str );
temp.close();
byte[] bytes = out.toByteArray();
String deserialized = new String( bytes, encoding );
So what am I doing wrong? Thanks!
Upvotes: 3
Views: 1028
Reputation: 45453
Although it's not a valid char, @Ant shows that encoding-decoding returns the original. This is probably because UTF-16 is a very simple&direct encoding, coinciding with Java's 16bit char representation.
If we experiment with UTF-8 instead, the encoding should throw a fatal error. There is no way for UTF-8 to encode half of a surrogate pair.
Upvotes: 0
Reputation: 206996
When I lookup the code de4e
on the online Unicode charts, it says this code is in the Low Surrogate Charts. It's not a character by itself, but a special code that's used in UTF-16 (according to the documentation there).
Unicode is not as simple as a single character maps to a single code point - there are lots of quirks and things, and different code points and sequences of bytes might refer to the same character.
It's very well possible that a some code points, when serialized and deserialized, result in a different but equivalent code point.
Upvotes: 3
Reputation: 100186
DE4E is 1/2 of a surrogate pair. By itself, it's invalid. It will be converted to ? or discarded by the OutputStreamWriter. If you use use the java.nio classes you can see the errors.
Upvotes: 6
Reputation: 6588
public static void main(String[] args) throws IOException {
Charset encoding = Charset.forName( "UTF-16BE" );
ByteArrayOutputStream out = new ByteArrayOutputStream();
Writer temp = new OutputStreamWriter( out, encoding );
String str = "\ud34e";
temp.write( str );
temp.close();
byte[] bytes = out.toByteArray();
String deserialized = new String( bytes, encoding );
System.out.println("'" + str + "' / '" + deserialized + "' / " + (str.equals(deserialized)));
}
for me, the output is:
'?' / '?' / true
i.e. they ARE equal...
I'm using: java version "1.6.0_24" Java(TM) SE Runtime Environment (build 1.6.0_24-b07) Java HotSpot(TM) 64-Bit Server VM (build 19.1-b02, mixed mode)
Upvotes: 1