Reputation: 2548
Recently I exported parts of my mySQL
database, and noticed that the text had several strange characters in it. For example, the string ’
often appeared.
When trying to find out what this meant, I found the stackoverflow question: Character Encoding and the ’ Issue. From that question I now know that the string ’
stands for a quote.
But how can I find out more generally what a string of characters stands for? For example, the letter Â
often appears in my database as well, and is actually causing me a problem now on a certain page, and to solve the problem, I would like to know what that character means.
I've looked at several tables showing character encoding, but haven't been able to figure out how to use these tables to see why ’
means '
, or, more importantly for me, what Â
stands for. I'd be very grateful if someone could point me in the right direction.
Upvotes: 3
Views: 1261
Reputation: 142208
The latin1 encoding for ’
is (in hex) E28099
.
The utf8 encoding for ’
is E28099
.
But you pasted in C3A2E282ACE284A2
, which is the "double encoding" of that apostrophe.
What apparently happened is that you had ’
in the client; the client was generating utf8 encodings. But your connection parameters to MySQL said "latin1". So, your INSERT
statement dutifully treated it as 3 latin1 characters E2 80 99
(visually ’
), and converted each one to utf8, hex C3A2 E282AC E284A2
.
Read about "double encoding" in Trouble with UTF-8 characters; what I see is not what I stored
Meanwhile, browsers tend to be forgiving about double-encoding, or else it might have shown ’
latin1 characters are each 1 byte (2 hex digits). utf8/utf8mb4 characters are 1-to-4 bytes; some 2-byte and 3-byte encodings showed up in your exercise.
As for Â
... Go to http://mysql.rjweb.org/doc.php/charcoll#8_bit_encodings and look at the second table there. Notice how the first two columns have lots of things starting with Â
. In latin1, that is hex C2
. In utf8, many punctuation marks are encoded as 2 bytes: C2xx
. For example, the copyright symbol, ©
is utf8 hex C2A9
, which is misinterpreted ©
.
Upvotes: 2