karlis
karlis

Reputation: 920

Java String encoding - Linux different than on Windows

I have created a java program (REST) service. The whole development / testing was done on Windows, now the deployment to production-test is in work. However, there is "small" encoding problem occurred:

String s3 = new String("grün".getBytes(), "UTF-8");
logger.info(s3);
logger.info("das ist wirklich grün");
logger.info(new String("das ist wirklich grün".getBytes("UTF-8"), "UTF-8"));

I receive through HTTP-attributes (the web application is hosted on tomcat, behind an apache which has an auth plugin) a few values I have to process. They are coded like you see in line 1. (This value is shown on both Windows and Linux).

When I convert it to UTF-8 like in line 1 and write it to the log-file (log4j) I have the term "grün" (which is correct) on my Windows machine. On the linux server I still have the same output.

Then I tried to directly use Umlaute (üäö etc), like in line 2, and both on Windows and Linux the value is written correct into the log-file. I then tried to do some convert like in line 3, however, the same result: both operating systems show the same result.

Both machines have the same Locale in Java (Locale.getDefault()) -> I tried that already. I'm not able to change the way the value is inserted into the HTTP-Request!

Upvotes: 4

Views: 7176

Answers (3)

Tunde Pizzle
Tunde Pizzle

Reputation: 827

Please compare the JVM version on both environment. This is most probable issue related to encoding.

Upvotes: 0

Jesper
Jesper

Reputation: 206996

Something like this is invalid:

String s3 = new String("grün".getBytes(), "UTF-8");

What happens here: You get the bytes for the string "grün" using the default character encoding of the system that you are running this on (because you didn't specify an encoding in the call to getBytes()) and then you convert those bytes back to a String, specifying that these bytes are UTF-8 encoded text:

characters => bytes in default character encoding (which may or may not be UTF-8) => convert back to characters as if the bytes are UTF-8 encoded text

That will obviously only work correctly if the default character encoding of the system is UTF-8. On Windows it is not (it's probably Windows-1252).

Strings by themselves do not have a character encoding. There is no such thing as "an UTF-8 string" or "converting a string from X to UTF-8". A character encoding specifies how characters in a string are converted to bytes and vice versa, but is not a property of the string itself. You can have an array of bytes which represents text encoded in a specific character encoding. (Just like "decimal" and "hexadecimal" is not a property of a number itself, just how you present a number).

Don't write your program in such a way that it depends on the default character encoding of the system it's running on; that means, don't call getBytes() on a String without specifying the character encoding, for example (and there are other API calls that use the default encoding if you don't specify it).

Upvotes: 1

Stephen C
Stephen C

Reputation: 719561

Both machines have the same Locale in Java (Locale.getDefault()) -> I tried that already.

It is the default charset, not the default locale that determines what character set is used when decoding / encoding a string without a specified charset.

Check what Charset.defaultCharset().name() returns on your Windows and Linux machines. I expect that they will be different, based on the symptoms that you are reporting.

Upvotes: 5

Related Questions