parsecer
parsecer

Reputation: 5176

Converting string with getBytes between two encodings

I have the following code:

String s0="Përshëndetje botë!";
byte[] b1=s0.getBytes("UTF8");
byte[] b2=s0.getBytes("ISO8859_1");
String s0_utf8=new String(b1, "UTF8");  //right encoding, wrong characters
//String s0_utf8=new String(b1, "ISO8859_1"); //wrong encoding, wrong characters
String s0_iso=new String(b2, "UTF8");  //wrong encoding; outputs right characters
//String s0_iso=new String(b2, "ISO-8859-1");  //right encoding; if uncommented, outputs damaged characters
System.out.println("s0_utf8: "+s0_utf8);  //
System.out.println("s0_iso: "+s0_iso);

So, the string "Përshëndetje botë!" is converted into bytes using UTF8 and ISO-8859-1, then those bytes are converted back to Unicode characters using corresponding encodings. The right characters are displayed only in one case here: if we encoded the original string into bytes using ISO8859_1 and decoded it using UTF-8. All other cases result in wrong characters.

String s0="P\u00ebrsh\u00ebndetje bot\u00eb!";
byte[] b1=s0.getBytes("UTF8");
byte[] b2=s0.getBytes("ISO8859_1");
String s0_utf8=new String(b1, "UTF8"); //right encoding; outputs right characters
//String s0_utf8=new String(b1, "ISO8859_1"); //wrong encoding, wrong characters
String s0_iso=new String(b2, "UTF8");  //wrong encoding; outputs wrong characters
//String s0_iso=new String(b2, "ISO-8859-1");  //right encoding; if uncommented, outputs damaged characters
System.out.println("s0_utf8: "+s0_utf8);  //
System.out.println("s0_iso: "+s0_iso);

Here there are two cases when the right words are displayed: when the string is both encoded and decoded using the same encoding.

I don't understand what's going on here. How is that possible? What difference does Unicode's representation of characters make? Why the pair enode with iso - decode with utf8 works? Shouldn't the result string be completely different from the original, since iso's bytes might be interpreted differently by utf8?

Upvotes: 0

Views: 721

Answers (2)

parsecer
parsecer

Reputation: 5176

This answer really helped me to understand what's going on.

In the first case:

String s0="Përshëndetje botë!";

s0 is in ISO8859_1;

b1: get bytes in UTF-8,

b2: get bytes in ISO8859_1.

IDEA converts the ë characters wrongly => Përshëndetje botë!

String s0_iso=new String(b2, "UTF8"); converts the string into the IDEA's encoding and it gets printed correctly.

String s0_iso=new String(b2, "ISO-8859-1"); doesn't change the original string => Përshëndetje botë!

When the string gets converted into foreign encoding (UTF-8), the trouble is coming:

String d=new String(b1, "UTF8"); => Përshëndetje botë!

String b=new String(b1, "ISO8859_1");=> Përshëndetje botë!

I'm still not entirely sure what's going on in these two cases but

d.equals("Përshëndetje botë!") is true.

My guess is when the string is compiled in utf-8 (default) compiler interpreters the characters in s0 as if they were in UTF-0 already and no real conversion happens; the characters turn out damaged because there is nothing like this in UTF-8. During the construction of the d string literaly the same happens, but through the code itself, so the characters are handled as if they are already in UTF-8 and then pushed to a String in the same UTF-8. But they should have been decoded from ISO8859_1 first and only then encoded into UTF-8 so that's why the output turns out wrong.

In the second case:

String s0="P\u00ebrsh\u00ebndetje bot\u00eb!";

the original string is already fully in UTF-8. Therefore there will be less problems with displaying it.

String d = new String(b1, "UTF8") doesn't change the original string; d.equals(s0) is true => Përshëndetje botë!

String p =new String(b1, "ISO8859_1") converts the original UTF-8 string into ISO8859_1 => Përshëndetje botë!

p.equals("Përshëndetje botë!") is true.

Not sure what's going on here though and why the last one gets all characters correctly:

String s0_iso=new String(b2, "UTF8") => P�rsh�ndetje bot�

String s0_iso=new String(b2, "ISO-8859-1") => Përshëndetje botë!

Upvotes: 0

JB Nizet
JB Nizet

Reputation: 692121

My guess is that the strings are wrong from the start, because your Java source file is encoded in encoding A, and the compiler reads it with encoding B. That explains why the problem doesn't happen when you use escape sequences rather than accents.

Regarding

//String s0_iso=new String(b2, "ISO-5589-1");  //right encoding; if uncommented, outputs damaged characters

no, it's not the right encoding. 5589 != 8859.

Upvotes: 2

Related Questions