Reputation: 3
We have a Java code that, when running on the client's machine, seems to behave incorrectly with regard to character encoding. Unfortunately we're unable to reproduce the issue locally. Here's the code:
static private String bytesToHex(byte[] in) {
final StringBuilder builder = new StringBuilder();
for(byte b : in) {
builder.append(String.format("%02x", b));
}
return builder.toString();
}
static private String normalize(String str) {
System.out.println("Normalize " + str + " / hex " + bytesToHex(str.getBytes(StandardCharsets.UTF_8)));
// ...
}
The hex part of the print is incorrect in the client's logs for diacritic characters, e.g. if str = "ë" we get:
In the client output it looks like the UTF8 bytes for "ë" (c3ab) have been interpreted in another encoding such as ISO-8859-1, so the string became "ë" which in UTF8 is c383c2ab.
Any idea how this could happen?
Edit: alright, solved. In fact str did contain "ë" but the client's log was written in ISO and I was reading it in UTF8 mode, that's why I ended up with a "ë". Now as to why str contains this: it comes from a REST call and apparently the line new InputStreamReader(conn.getInputStream())
was using UTF8 on my system and ISO on others! So the fix is to specify UTF8 in the constructor. Bunch of weird issues adding up.
Upvotes: 0
Views: 102