decoding and encoding strings, ISO-8859-1 to UTF-8 in Java

Question

I have read the other posts on this issue, but the solutions they presented did not work for me. Actually, the official Java documentation also did not work as intended (I am using Java 11) : https://docs.oracle.com/javase/tutorial/i18n/text/string.html

My problem is that I am reading one byte at a time from a byte buffer, putting that in a byte array, and making a String out of that byte array. The bytes I read are from an embedded system that can only send ISO-8859-1 bytes, so I end up with a byte array with ISO-8859-1 bytes and the Java String I end up getting is thus ISO-8859-1 encoded. No problem here. The String in IntelliJ looks like this :

The bytes I am trying to convert from ISO-8859-1 to UTF-8 are the ones in yellow. I want them to be UTF-8, so in the end the "C9" byte should be replace by the "C3A9" bytes.

The first step works correctly, I do this : maintenanceResponseString.getBytes(StandardCharsets.UTF_8) and I get the right bytes that I want, the UTF-8 encoding of the string, that's good :

The problem comes in here , when I try to make a STRING out of these new (and GOOD) bytes, like this :

new String(maintenanceResponseString.getBytes(StandardCharsets.UTF_8), StandardCharsets.UTF_8)

The old bytes are back ?!! It's like the "getBytes(UTF-8)" never actually happened. That is NOT what the documentation says should happen... what am I missing here ? I have done tests and the string really is still ISO-8859-1 encoded... I don't know what is going on here. Where are the bytes from "getBytes" ?

How do you convert a String that contains ISO-8859-1 bytes to UTF-8 bytes ? I'm out of alternatives and I need to get it done real bad for a pro project... this should be easy !

Note : I have tried alternatives like

ByteBuffer buffer = StandardCharsets.UTF_8.encode(s);
return StandardCharsets.UTF_8.decode(buffer).toString();

But the exact same thing happens.

Thank you in advance for your help.

EDIT : With some info in the comments about how Strings in Java 9+ get represented internally not as UTF-16 only anymore, but Latin-1 (why...), I think that is what made me think the Strings were "internally encoded in Latin-1" when it is just the default representation of the String if we don't specify the encoding we want to use when displaying the String.

From what I undestand now the String itself is not bound to any encoding, and you can CHOOSE the encoding you want to display it in when it gets written. Actually my issue is that the String ends up written to an XML file via JAXB marshalling in LATIN-1, and I now think the issues lies over there... I will dig further when I access my work computer again and report here

Genku · Accepted Answer

It turns out there was nothing wrong with Strings and "their encoding". What happened is I got really confused because the debugger shows the contents of the String in a "default internal storage encoding", and that is ISO-8859-1 (but can be UTF-16, depends on the content of the String).

Quote from the JEP-254 :

We propose to change the internal representation of the String class from a UTF-16 char array to a byte array plus an encoding-flag field. The new String class will store characters encoded either as ISO-8859-1/Latin-1 (one byte per character), or as UTF-16 (two bytes per character), based upon the contents of the string. The encoding flag will indicate which encoding is used.

But actually it doesn't matter the internal encoding storage. When it is time to be written, the String will use whatever encoding you want at the time of writing.

My issue actually was when I was sending the String in an HTTP request with Spring RestTemplate. I didn't have the header specifying the "charset" to use in the request, and RestTemplate defaults to ISO-8859-1 if not told otherwise. I added the charset=utf-8, and the String was correctly written as UTF-8 in the request.

Thank you to @VGR @Eugene @skomisa for the help

decoding and encoding strings, ISO-8859-1 to UTF-8 in Java

Answers (1)

Related Questions