Reputation: 2184
When the browser sends data in the body of a POST request (i.e. the name=value
pairs from form elements), how does PHP determine the character encoding so it can properly decode the bit stream into characters for its own internal usage?
I can understand for some tasks where PHP won't need to decode, e.g. for SQL INSERT queries, it may simply pass the data/string along to the DBMS with no additional processing.
But for text processing/regex operations, I imagine PHP will need to decode the bit stream into characters, before it can perform test, pattern matches etc on them.
Also, it seems that because the encoding is determined by the browser, PHP will need guidance from the browser on what charset it used to encode the POST data.
Expecting this guidance would be in the request headers, I set up a text form with
<meta charset="utf-8">
in the head of the webpage containing the form, then after entering some values and submitting the form, the request headers contains no obvious information about how it encoded the POST data
POST /experiments/foo.php HTTP/1.1
Host: localhost
Connection: keep-alive
Content-Length: 57
Pragma: no-cache
Cache-Control: no-cache
Origin: http://localhost
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36
Content-Type: application/x-www-form-urlencoded
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8
Referer: http://localhost/experiments/how_does_php_encode_data_it_receives_from_browser.php
Accept-Encoding: gzip, deflate
Accept-Language: en-GB,en-US;q=0.8,en;q=0.6
Or is there something else going on? e.g. is the browser expected to encode characters to some pre-determined standard?
How does PHP know how to decode data it receives from the browser POST requests?
Upvotes: 5
Views: 3288
Reputation: 1556
From PHP.net - Description of core php.ini directives:
default_charset string
In PHP 5.6 onwards, "UTF-8" is the default value and its value is used as the default character encoding for htmlentities(), html_entity_decode() and htmlspecialchars() if the encoding parameter is omitted. The value of default_charset will also be used to set the default character set for iconv functions if the iconv.input_encoding, iconv.output_encoding and iconv.internal_encoding configuration options are unset, and for mbstring functions if the mbstring.http_input mbstring.http_output mbstring.internal_encoding configuration option is unset.
All versions of PHP will use this value as the charset within the default Content-Type header sent by PHP if the header isn't overridden by a call to header().
Example:
Content-Type: text/html; charset=UTF-8
The <meta charset="utf-8"> tag is only useful on responses that don't have this header. But because the content-type header has higher precedence than the meta tag, and PHP always adds this header, the value of the mega tag charset attribute is ignored.
When you submit a form with method=POST (or GET) it URL encodes the name-value pairs in the declared charset and adds them to the body of the POST request. Then PHP decodes them again and adds them to the $_POST array still in the declared charset. (Usually this will be UTF-8.)
PHP's internal functions work based of the settings in php.ini. For example if default_charset is set to UTF-8, then functions like htmlspecialchars will return an empty string if it is passed a string containing any invalid UTF-8 byte sequences. From PHP.net:
The converted string
If the input string contains an invalid code unit sequence within the given encoding an empty string will be returned, unless either the ENT_IGNORE or ENT_SUBSTITUTE flags are set.
Upvotes: 1
Reputation: 2184
In regard to GET data, the W3C standard states
Note. The "get" method restricts form data set values to ASCII characters.
Only the "post" method (with enctype="multipart/form-data") is specified to cover the entire [ISO10646] character set.
So with GET the browser seems to be locked into ASCII, if the form element has the attribute enctype="multipart/form-data"
it seems the standard supports the larger charset [ISO10646]
.
And I guess because it is closer to a pure bitstream, the default Content-type
of application/x-www-form-url-encoded
supports all character encodings. in particular this article states:
http://www.herongyang.com/PHP/Non-ASCII-Form-Basic-Rules.html
URL encoding converts all non ASCII bytes in the form of "%xx", "xx" is the HEX value of the byte.
So this seems to explain what charsets the browser can possibly send, but not how it instructs PHP as to what actual charset it sent. (with the exception of GET, which PHP will know can only be ASCII). O
Other wise from what I can understand there is essentially no direct guidance from the browser as to the character encoding of the form data it's sending.
I could be wrong though and would be interested in any feedback/alternatives to this theory.
Otherwise, from what I can tell the integrity of the scheme essentially relies on the server simply "remembering" what
<meta charset="utf-8">
or
<form ... accept-charset="utf-8">
values it was sending to users (and hoping users didn't change the character encoding via browser "settings") and expecting that the browser will faithfully send subsequent requests in that charset.
So in other words, if you had a web designer on your team responsible for HTML and they set the HTML meta tag <meta charset="utf-8">
they would need to inform the database admin, hey, you need to set up your database schema, tables etc to expect UTF-8 encoding.
This is because the server side devs/DBA's won't be able to dynamically check for the encoding (e.g. if a form submission came from a user in a different country, whose browser may be set to some different charset).
and potentially reject or log a warning etc...
Basically it seems the devs need to explicitly set charset for every HTML page containing forms, e.g. with <meta charset="utf-8">
and then just trust that the browser will send the POST data in the same charset that the HTML containing the form was encoded in.
Upvotes: 2