Maarten Bodewes
Maarten Bodewes

Reputation: 93958

What is the character set if default_charset is empty

In PHP 5.6 onwards the default_charset string is set to "UTF-8" as explained e.g. in the php.ini documentation. It says that the string is empty for earlier versions.

As I am creating a Java library to communicate with PHP, I need to know which values I should expect when a string is handled as bytes internally. What happens if the default_charset string is empty and a (literal) string contains characters outside the range of ASCII? Should I expect the default character encoding of the platform, or the character encoding used for the source file?

Upvotes: 4

Views: 7485

Answers (2)

Giedrius D
Giedrius D

Reputation: 1263

Short answer

For literal strings -- always source file encoding. default_charset value does nothing here.

Longer answer

PHP strings are "binary safe" meaning they do not have any internal string encoding. Basically string in PHP are just buffers of bytes.

For literal strings e.g. $s = "Ä" this means that string will contain whatever bytes were saved in file between quotes. If file was saved in UTF-8 this will be equivalent to $s = "\xc3\x84", if file was saved in ISO-8859-1 (latin1) this will be equivalent to $s = "\xc4".

Setting default_charset value does not affect bytes stored in strings in any way.

What does default_charset do then?

Some functions, that have to deal with strings as text and are encoding aware, accept $encoding as argument (usually optional). This tells the function what encoding the text is encoded in a string.

Before PHP 5.6 default value of these optional $encoding arguments were either in function definition (e.g. htmlspecialchars()) or configurable in various php.ini settings for each extension separately (e.g. mbstring.internal_encoding, iconv.input_encoding).

In PHP 5.6 new php.ini setting default_charset was introduced. Old settings were deprecated and all functions that accept optional $encoding argument should now default to default_charset value when encoding is not specified explicitly.

However, developer is left responsible to make sure that text in string is actually encoded in encoding that was specified.


Links:

Upvotes: 8

oshell
oshell

Reputation: 9103

It seems you should not rely on the internal encoding. The internal character encoding can be seen/set with mb_internal_encoding.

example phpinfo()

  • PHP Version 5.5.9-1ubuntu4.5
  • default_charset no value

file1.php

<?php
$string = "e";
echo mb_internal_encoding(); //ISO-8859-1

file2.php

<?php
$string = "É";
echo mb_internal_encoding(); //ISO-8859-1

both files will output ISO-8859-1 if you do not change the internal encoding manually.

<?php
echo bin2hex("ö"); //c3b6 (utf-8)

Getting the hex of this character returns UTF-8 encoding. If you save the file using UTF-8 the string in this example will have 2 bytes, even if the internal encoding is not set to UTF-8. Therefore you should rely on the character encoding used for the source file.

Upvotes: 2

Related Questions