Reputation: 5007
My boss likes to use n-dashes. They always cause problems with encoding and I cannot work out why.
I store my TEXT field in a database under the charset: utf8_general_ci
.
I have the following tags under my <head>
on my webpage:
<meta charset="UTF-8">
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
I pull the information from my database with the following set:
mysql_set_charset('UTF8',$connection);
(I know MYSQL is depreciated)
But when I get information from the database, I end up with this:
– Europe
If I take this string and run it through utf8_decode
, I get this:
��? Europe
I even tried running it thorugh utf8_encode, and I got this:
âÃâ¬Ãâ Europe
Can someone explain to me why this is happening? I dont understand. I even ran the string through mb_detect_encoding
and It said the string was utf8. So why is not printing correctly?
The solution (or not really a solution, because it ruins the rest of the website) is to remove the mysql_set_encoding
line, and use utf8_decode
. Then it prints out fine. BUT WHY!?
Upvotes: 1
Views: 47
Reputation: 125835
You have to remember that computers handle all forms of data as nothing more than sequences of 1s and 0s. In order to turn those 1s and 0s into something meaningful, the computer must somehow be told how those bits should be interpreted.
When it comes to a textual string, such information regarding its bits' interpretation is known as its character encoding. For example, the bit sequence 111000101000000010010011
, which for brevity I will express in hexadecimal notation as 0xe28093
, is interpreted under the UTF-8 character encoding to be your boss's much-loved U+2013 (EN-DASH); however that same sequence of bits could mean absolutely anything under a different encoding: indeed, under the ISO-8859-1 encoding (for example), it represents a sequence of three characters: U+00E2 (LATIN SMALL LETTER A WITH CIRCUMFLEX), U+0080 (<control>) and U+0093 (SET TRANSMIT STATE).
Unfortunately, in their infinite wisdom, PHP's developers decided not to keep track of the encoding under which your string variables are stored—that is left to you, the application developer. Worse still, many PHP functions make arbitrary assumptions about the encoding of your variables, and they happily go ahead manipulating your bits without any thought of the consequences.
So, when you call utf8_decode
on a string: it takes whatever bits you provide, works out what characters they happen to represent in UTF-8, and then returns to you those same characters encoded in ISO-8859-1. It's entirely possible to come up with an input sequence that, when passed to this function, produces absolutely any given result; indeed, if you provide as input 0xc3a2c280c293
(which happens to be the UTF-8 encoding of the three characters mentioned above), it will produce a result of 0xe28093
—the UTF-8 encoding of an "en dash"!
Such double encoding (i.e. UTF-8 encoded, treated as ISO-8859-1 and transcoded to UTF-8) appears to be what you're retrieving from MySQL when you do not call mysql_set_charset
(in such circumstances, MySQL transcodes results to whatever character set the client specifies upon connection—the standard drivers use latin1
unless you override their default configuration). In order for a result that MySQL transcodes to latin1
to produce such double encoded UTF-8, the value that is actually stored in your column must have been triple encoded (i.e. UTF-8 encoded, treated as ISO-8859-1, transcoded to UTF-8, then treated as latin1
again)!
You need to fix the data that is stored in your database:
Identify exactly how the incumbent data has actually been encoded. Some values may well be triple-encoded as described above, but others (perhaps that predate particular changes to your application code; or that were inserted/updated from a different source) may be encoded in some other way. I find SELECT HEX(myColumn) FROM myTable WHERE ...
to be very useful for this purpose.
Correct the encodings of those values that are currently incorrect: e.g. UPDATE myTable SET myColumn = BINARY CONVERT(myColumn USING latin1) WHERE ...
—if an entire column is misencoded, you can instead use ALTER TABLE
to change it to a binary string type and then back to a character string of the correct encoding. Beware of transformations that increase the encoded length, as the result might overflow your existing column size.
Upvotes: 1