jules
jules

Reputation: 1918

How to "fix" broken Java Strings (charset-conversion)

I'm running a Servlet that takes POST requests from websites that aren't necessarily encoded in UTF-8. These requests get parsed with GSON and information (mainly strings) end up in objects.

Client side charset doesn't seem to be used for any of this, as Java just stores Strings in Unicode internally.

Now if a page sending a request has a non-unicode-charset, the information in the strings is garbled up and doesn't represent what was sent - it seems to be misinterpreted somewhere either in the process of being stringified by the servlet, or parsed by gson.

Assuming there is no easy way of fixing the root of the issue, is there a way of recovering that information, given the (misinterpreted) Java Strings and the charset identifier (i.e. "Shift_JIS", "Windows-1255") used to display it on the client's side?

Upvotes: 2

Views: 3298

Answers (4)

Ryan Stewart
Ryan Stewart

Reputation: 128849

The correct way to fix the problem is to ensure that when you read the content, you do so using the correct character encoding. Most frameworks and libraries will take care of this for you, but if you're manually writing servlets, it's something you need to be aware of. This isn't a shortcoming of Java. You just need to pay attention to the encodings. Specifically, the Content-Type header should contain useful information.

Any time you convert from a byte stream to a character stream in Java, you should supply a character encoding so that the bytes can be properly decoded into characters. See for example the InputStreamReader constructors.

Upvotes: 0

BalusC
BalusC

Reputation: 1108802

Assuming that it's obtained as a POST request parameter the following way

String string = request.getParameter("name");

then you need to URL-encode the string back to the original query string parameter value using the charset which the server itself was using to decode the parameter value

String original = URLEncoder.encode(string, "UTF-8");

and then URL-decode it using the intended charset

String fixed = URLDecoder.decode(original, "Shift_JIS");

As the better alternative, you could also just instruct the server to use the given charset directly before obtaining any request parameter by ServletRequest#setCharacterEncoding().

request.setCharacterEncoding("Shift_JIS");
String string = request.getParameter("name");

There's by the way no way to know about the charset which the client used to URL-encode the POST request body. Almost no of the clients specifies it in the Content-Type request header, otherwise the ServletRequest#setCharacterEncoding() call would be already implicitly done by the servlet API based on that. You could determine it by checking getCharacterEncoding(), if it returns null then the client has specified none.

However, this does of course not work if the client has already properly encoded the value as UTF-8 or for any other charset. The Shift_JIS massage would break it again. There exist tools/API's to guess the original charset used based on the obtained byte sequence, but that's not 100% reliable. If your servlet concerns a public API, then you should document properly that it only accepts UTF-8 encoded parameters whenever the charset is not specified in the request header. You can then move the problem to the client side and point them on their mistake.

Upvotes: 2

Giorgio
Giorgio

Reputation: 5183

Am I correct that what you get is a string that was parsed as if it were UTF-8 but was encoded in Windows-1255? The solution would be to encode your string in UTF-8 and decode the result as Windows-1255.

Upvotes: 0

Andrzej Doyle
Andrzej Doyle

Reputation: 103797

I haven't had need to do this before, but I believe that

final String realCharsetName = "Shift_JIS"; // for example
new String(brokenString.getBytes(), realCharsetName);

stands a good chance of doing the trick.

(This does however assume that encoding issues were entirely ignored while reading, and so the platform's default character set was used (a likely assumption since if people thought about charsets they probably would have got it right). It also assumes you're decoding on a machine with the same default charset as the one that originally read the bytes and created the String.)

If you happen to know exactly which charset was incorrectly used to read the string, you can pass it into the getBytes() call to make this 100% reliable.

Upvotes: 2

Related Questions