Snap E Tom
Snap E Tom

Reputation: 420

Deciphering MySQL Encoding

I'm having an issue with encoding in MySQL, and I need some help in figuring out what's going on.

First, some parameters. The default encoding of the table is utf8. The character_set_client, character_set_connection, collation_connection, and character_set_server MySQL system variables, though, are all latin1.

I ssh into my MySQL server and I connect to the local server using the local command line client. I select record/column and the string that's returned, let's say the character comes back as A, which is correct. A is represented by hex in UTF-8 as "C5 9F."

However, the PHP app that hits the server interprets it as XY. In the MySQL commandline client, if I send the command "SET NAMES utf8", it will also now display it as XY.

If I do a select INTO OUTFILE and use hexedit to edit the file, I see two hex characters that map to X, then two hex characters that map to Y. ("c3 85" for X and "C5 B8" for Y). Basically, it's taking the two hex values and displaying them indeed as UTF8 characters.

First and foremost, it looks like the database is indeed storing things as UTF8, but the wrong kind of UTF8, correct? Are they going in as raw Unicode, but somehow, maybe because of the sytem variables, it is not being translated to UTF8?

Second, how/why is the MySQL command line client correctly interpreting XY as A?

Finally, to the successful interpretation of the MySQL command line, is there a chart that shows how C3 85 C5 B8 is getting converted to A, or XY is getting converted to A?

Thanks a bunch for any insight.

Upvotes: 1

Views: 203

Answers (1)

deceze
deceze

Reputation: 522042

Your question is kind of confusing, so I'll explain with an example of my own:

You connect to the database without issuing SET NAMES, so the connection is set to Latin-1. That means the database expects any communication between you and it to be encoded in Latin-1.
You send the bytes C3A2 to the database, which you want to mean "â" in the UTF-8 encoding.
The database, expecting Latin-1, is interpreting this as the characters "¢" (C3 and A2 in the Latin-1 encoding).
The database will store these two characters internally in whatever encoding the table is set to.

You connect to the database in a different fashion, running SET NAMES UTF-8. The database now expects to talk to you in UTF-8.
You query the data stored in the database, you receive the characters "¢" encoded in UTF-8 as C382 C2A2, because you told the database to store the characters "¢" and you are now querying them over a UTF-8 connection.

If you connected to the database again using Latin-1 for the connection, the database would give you the characters "¢" encoded in Latin-1, which are the bytes C3 A2. If the client that you used to connect is interpreting that in Latin-1, you'll see the characters "¢". If the client is interpreting that as UTF-8, you'll see the character "â".

Essentially these are the points at which something can screw up:

  • the database will interpret any bytes it receives as characters in whatever encoding is set for the connection and convert the encoding of these characters to match the table they're supposed to be stored in
  • the database will convert the encoding of any characters from the encoding they're stored in into the encoding of the connection when retrieving data
  • the client may or may not interpret the bytes it receives from the database into the right characters to display on screen, especially command line environments aren't always set to correctly display UTF-8 data

Hope that helps.

Upvotes: 2

Related Questions