amit prinz
amit prinz

Reputation: 33

Java String->unicode->String transformation inconsistency

I'm trying to serialize a java string into an array of bytes and then deserialize the array into a string again. It seemed to work OK until I tested of a the unicode character \ude4e. For some reason, the original string "\ud34e" is not equal to the deserialized string.

This is the serialization code (where encoding = Charset.forName( "UTF-16BE" ) and str = "\ud34e")

ByteArrayOutputStream out = new ByteArrayOutputStream();

Writer temp = new OutputStreamWriter( out, encoding );

temp.write( str );

temp.close();

byte[] bytes = out.toByteArray();

String deserialized = new String( bytes, encoding );

So what am I doing wrong? Thanks!

Upvotes: 3

Views: 1028

Answers (4)

irreputable
irreputable

Reputation: 45453

Although it's not a valid char, @Ant shows that encoding-decoding returns the original. This is probably because UTF-16 is a very simple&direct encoding, coinciding with Java's 16bit char representation.

If we experiment with UTF-8 instead, the encoding should throw a fatal error. There is no way for UTF-8 to encode half of a surrogate pair.

Upvotes: 0

Jesper
Jesper

Reputation: 206996

When I lookup the code de4e on the online Unicode charts, it says this code is in the Low Surrogate Charts. It's not a character by itself, but a special code that's used in UTF-16 (according to the documentation there).

Unicode is not as simple as a single character maps to a single code point - there are lots of quirks and things, and different code points and sequences of bytes might refer to the same character.

It's very well possible that a some code points, when serialized and deserialized, result in a different but equivalent code point.

Upvotes: 3

bmargulies
bmargulies

Reputation: 100186

DE4E is 1/2 of a surrogate pair. By itself, it's invalid. It will be converted to ? or discarded by the OutputStreamWriter. If you use use the java.nio classes you can see the errors.

Upvotes: 6

Ant Kutschera
Ant Kutschera

Reputation: 6588

public static void main(String[] args) throws IOException {
    Charset encoding = Charset.forName( "UTF-16BE" );

    ByteArrayOutputStream out = new ByteArrayOutputStream();

    Writer temp = new OutputStreamWriter( out, encoding );

    String str = "\ud34e";

    temp.write( str );

    temp.close();

    byte[] bytes = out.toByteArray();

    String deserialized = new String( bytes, encoding );

    System.out.println("'" + str + "' / '" + deserialized + "' / " + (str.equals(deserialized)));
}

for me, the output is:

'?' / '?' / true

i.e. they ARE equal...

I'm using: java version "1.6.0_24" Java(TM) SE Runtime Environment (build 1.6.0_24-b07) Java HotSpot(TM) 64-Bit Server VM (build 19.1-b02, mixed mode)

Upvotes: 1

Related Questions