Reputation: 677
The PHP documentation says:
Of course, in order to be useful, functions that operate on text may have to make some assumptions about how the string is encoded. Unfortunately, there is much variation on this matter throughout PHP’s functions:
[... a few special cases are described ...]
Ultimately, this means writing correct programs using Unicode depends on carefully avoiding functions that will not work and that most likely will corrupt the data [...]
Source: https://www.php.net/manual/en/language.types.string.php
So naturally my question is: Where are these specifications that allow us to identify the encoding/charset associated to string arguments, return values, constants, array keys/values, ... for built-in functions/methods/data (e.g. array_key_exists
, DOMDocument::getElementsByTagName
, DateTime::format
, $_GET[$key]
, ini_set
, PDO::__construct
, json_decode
, Exception::getMessage()
and many more)? How do composer package providers specify the encodings in which they accept/provide textual data?
I have been working roughly with the following heuristic: (1) never change the encoding of anything, (2) when forced to pick an encoding, pick UTF-8. This has been working for years but it feels very unsatisfactory.
Whenever I try to find an answer to the question, I only get search results relating to url encoding, HTML entities or explaining the interpretation of string literals (with the source file's encoding).
Upvotes: 2
Views: 359
Reputation: 522032
Strings in PHP are what other languages would call byte arrays, i.e. purely a raw sequence of bytes. PHP is not generally interested in what characters those bytes represent, they're just bytes. Only functions that need to work with strings on a character level need to be aware of the encoding, anything else doesn't.
For example, array_key_exists
doesn't need to know anything about characters to figure out whether a key with the same bytes as the given string exists in an array.
However, mb_strlen
for example explicitly tells you how many characters the string consists of, so it needs to interpret the given string in a specific encoding to give you the right number of characters. mb_strlen('漢字', 'latin1')
and mb_strlen('漢字', 'utf-8')
give very different results. There isn't a unified way how these kinds of functions are made encoding aware*, you will need to consult their manual entries.
* The mb_
functions in particular generally use mb_internal_encoding()
, but other sets of functions won't.
Functions like DateTime::format
are looking for specific characters in the format string to replace by date values, e.g. d
for the day, m
for the month etc. You can generally assume that these are ASCII byte values it's looking for, unless specified otherwise (and I'm not aware of anything that specifies otherwise). So any ASCII compatible encoding will usually do.
For a lot more details, you may be interested in What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text.
Upvotes: 1