R E
R E

Reputation: 1

Java Character encoding, ISO to UTF conversion

This subject has been targeted in many discussions and yet we still see new ones showing up. My scenario is as follows:

A Java framework running on a Linux server where UTF-8 is the default character encoding in the JVM. The framework consists of some services receiving Tibco RV messages to be processed. And some of these messages contains non ASCII characters and are sent from a Windows server and ISO8859-1 is the encoding used when message are created. Now, when the data are extracted from the Tib rv message, the problematic fields "arrives" as Java Objects and needs to be cast to Strings... And here I haven't yet been able to extract ISO8859-1 Strings containing the non ASCII characters (swedish "å","ä","ö") to UTF-8 String in a proper way. I have tried using the following methods:

String isoStreet = new String(response.get("street").toString().getBytes(StandardCharsets.ISO_8859_1),java.nio.charset.StandardCharsets.UTF_8);

and I've also tried using the encoders/decoders within java.nio package with no success.

What's also interesting is that I'm using PuttY to connect to server where services are hosten and running. And from there I have the possibility to make a direct Tibco rv request from the shell (using the tibcorvsend client) and it seems like I need to set the remote character set to ISO8859-1 in PuttY (Window_>Translation) before signing in to server and make that Tib rv request - when this is done those none-ASCII-characters are shown correct in the response, no matter what encoding I set in remote Linux server. Using 'export LC_ALL=en_US.UTF-8' or 'export LC_ALL=sv_SE.iso88591' doesn't matter in this case... only what remote encoding I set in PuttY...

Thsi should imply that the response message seems OK and at least the shell is able to output proper characters. But when inside Java VM (using Java services) I guess the response fields are quietly pushed into Strings when debugging and viewing the response Object (not wanting this conversion to Strings) within Watch view... not sure if you could follow me on this, if not I may try to be more clear if needed...

Any input on this problem, anyone

Regards /R

Upvotes: 0

Views: 2388

Answers (1)

Jesper
Jesper

Reputation: 206776

A character encoding specifies how text, which consists of characters, is translated to bytes and vice versa. As you know there are different character encodings, such as ASCII, ISO-8859-1 and UTF-8.

A string consists of characters. At some point you want to convert these characters into bytes, so that you can send them over a network, store them in a file or whatever you want to do. You use a character encoding to translate the string into bytes. And at the other side, where you receive the bytes, you use the same character encoding to translate the bytes back into characters in a string.

Let's look at why a line like the one you posted is incorrect. Let's first rewrite it so that I can explain the parts:

String street = response.get("street").toString();
byte[] streetBytes = street.getBytes(StandardCharsets.ISO_8859_1);
String isoStreet = new String(streetBytes, StandardCharsets.UTF_8);

In the first line, you get some data from the response and convert it to a string. (What does response.get("street") return?).

In the second line you encode that string using the ISO-8859-1 character set. You get a byte array that contains valid ISO-8859-1 character codes for the characters in the string.

In the third line, you convert the bytes to a string, and you pretend that the bytes are UTF-8 bytes. That is obviously wrong, because the bytes are ISO-8859-1 data and not UTF-8 data. When you do this, you might get wrong characters, or even an exception if the byte array contains a sequence of bytes that is not a valid character according to UTF-8.

One thing to be aware of is that a string just consists of characters. A string does not have an encoding by itself. You use a character encoding to translate a string to bytes and vice versa. You cannot "change the character encoding of a string" because the character encoding is simply not a property of the string. Just like a number is not intrinsically decimal or hexadecimal - those are just different ways to represent the same number.

What you have to do is:

  • At the point where you write the message, make sure you use the right character encoding to convert strings to bytes.

  • At the point where you read the message, make sure you use the right character encoding to convert bytes to strings.

Do not read something into a string using the default character encoding of the platform and then try to "convert the string". That does not work.

Upvotes: 1

Related Questions