User2709
User2709

Reputation: 573

Weird results with character encodings

Here is the scenario-

We have a string which contains a Spanish character (ú), it is stored in the database using Spring's JDBCTemplate, so essentially JDBC.

At this stage I am a bit confused and have these questions-

What should I be doing it to solve at the application level rather at column level?

Any pointers would be helpful.

Upvotes: 3

Views: 1319

Answers (1)

Daniel Martin
Daniel Martin

Reputation: 23548

The effects you're seeing can all be explained by the assumption that data is written to the database as UTF-8 bytes, but that the database believes that those bytes are some other character set (either ISO-LATIN-1 or Windows-1252), and then when you read the data, the string you get back is those bytes interpreted as ISO-LATIN-1 or a related character set.

The character ú in UTF-8 is the two bytes 0xC3 0xBA. When those bytes are interpreted as ISO-LATIN-1 or win-1252, you get the two characters ú.

The two characters ú when written in UTF-8 are the four bytes 0xC3 0x83 0xC2 0xBA. When those four bytes are interpreted as ISO-LATIN-1, (or win-1252) you get the four characters ú.

(Windows-1252 and ISO-LATIN-1 happen to agree on all the bytes/characters in question, so from the evidence I can't tell the difference between them)

What's happening to you, I believe, is this:

  1. The JDBC clients are querying your database and are getting back a string containing the two characters ú from the database.

  2. When the JVM prints a result to the windows 7 console box, if it is not started with -Dfile.encoding=utf-8, it sends to the console box the bytes needed to represent the string in win-1252. If the JVM is started with that option, it sends to the console box the bytes necessary to represent the string in UTF-8.

  3. Your windows 7 console box is set to windows-1252, and displays what java prints out by interpreting the bytes java sends it according to windows-1252

  4. When you call .getBytes() with no argument, you are using the JVM's default encoding to turn the string into bytes. Therefore, new String(str.getBytes(), "UTF-8") will result in an identical string if the default JVM encoding is UTF-8, and can only result in something actually happening if the default encoding is something different than UTF-8.

This explains all the evidence you presented: the java string retrieved by JDBC contains the characters ú, and then when a non-utf-8 JVM tries to print this to the console box, this is printed as ú. When a utf-8 JVM tries to print this string to the console box, it prints the four byte 0xC3 0x83 0xC2 0xBA, and the console interprets that as the four characters ú. When a java web server tries to send this string back to the browser, it does so - what the browser sees is what the java application received out of JDBC.

The first thing to check is that the Spring JDBCTemplate is receiving the data correctly and writing to the database correctly. Can you get Spring to log what it receives from the browser somewhere, and ensure that the browser is sending UTF-8, and that Spring knows that the browser is sending UTF-8? (one thing you might want to check there is log what strings were received and how long the strings were in each field. That can let you know if things are being interpreted correctly as UTF-8)

Assuming that data is getting into the database correctly, and as you say that you can't make a change on the database side, and want a change purely from the application side, you can do this to every string received from JDBC:

new String(str.getBytes(StandardCharsets.ISO_8859_1), StandardCharsets.UTF_8)

That should transform your string back to what you want, regardless of what the JVM's default encoding is.

For future reference, running a jvm from the windows command line with -Dfile.encoding=utf-8 usually requires changing the codepage on your console first in order to see stuff correctly. (That can be done with the command chcp 65001. Just remember to use chcp 1252 to change back before running a JVM command without that option set)

Upvotes: 7

Related Questions