Reputation: 9008
I am struggling at understanding character encoding in PHP.
Consider the following script (you can run it here):
$string = "\xe2\x82\xac";
var_dump(mb_internal_encoding());
var_dump($string);
var_dump(unpack('C*', $string));
$utf8string = mb_convert_encoding($string, "UTF-8");
var_dump($utf8string);
var_dump(unpack('C*', $utf8string));
mb_internal_encoding("UTF-8");
var_dump($string);
var_dump($utf8string);
I have a string, actually the € character, represented with its unicode code points. Up to PHP 5.5
the used internal encoding is ISO-8859-1
, hence I think that my string will be encoded using this encoding. With unpack
I can see the bite representation of my string, and it corresponds to the hexadecimal codes I use to define the string.
Then I convert the encoding of the string to UTF-8
, using mb_convert_encoding
. At this point the string displays differently on the screen and its byte representation changes (and this is expected).
If I change the PHP
internal encoding also to UTF-8
, I'd expect utf8string
to be displayed correctly on the screen, but this doesn't happen.
What I am missing?
Upvotes: 1
Views: 1594
Reputation: 72186
You started with a string that is the utf-8
representation of the Euro symbol. If you run echo($string)
all versions of PHP produce the three bytes you put in $string
. How they are interpreted by the browser depends on the character set specified in the Content-Type
header. If it is text/html; charset=utf-8
then you get the Euro sign in the rendered page.
Then you do the wrong move. You call mb_convert_encoding()
with only two arguments. This lets PHP use the current value of its internal encoding used by the mb_string
extension for the the third argument ($from_encoding
). Why?
For PHP 5.6 and newer, the default value returned by mb_internal_encoding()
is utf-8
and the call to mb_convert_encoding()
is a no-op.
But for previous versions of PHP, the default value returned by mb_internal_encoding()
is iso-8859-1
and it doesn't match the encoding of your string. Accordingly, mb_convert_encoding()
interprets the bytes of $string
as three individual characters and encodes them using the rules of utf-8
. The outcome is obviously wrong.
Btw, if you initialize $string
with '€'
you get the same output on all PHP versions (even on PHP 4, iirc).
Upvotes: 1
Reputation: 50190
The script you show doesn't use any non-ascii characters, so its internal encoding does not make any difference. mb_internal_encoding
does convert your data on output. This question will tell you more about how it works; it will also tell you it's better not to use it.
The three-byte string $string
in your code is the UTF-8 representation of the Euro symbol, not its "unicode code point" (which is 2 bytes wide, like all common Unicode characters: 0x20ac
).
Does this clear up the behavior you see?
Upvotes: 2