Reputation: 723
EDIT2: The issue was with how my Perl client was interpreting the output from PHP's json_encode
which outputs Unicode code points by default. Putting the JSON
Perl module in ascii mode (my $j = JSON->new()->ascii();
) made things work as expected.
I'm interacting with an API written in PHP that returns JSON, using a client written in Perl which then submits a modified version of the JSON back to the same API. The API pulls values from a PostgreSQL database whose encoding is UTF8. What I'm running in to is that the API returns a different character encoding, even though the value PHP receives from the database is proper UTF-8.
I've managed to reproduce what I'm seeing with a couple lines of PHP (5.3.24):
<?php
$val = array("Millán");
print json_encode($val)."\n";
According to the PHP documentation, string literals are encoded ... in whatever fashion [they are] encoded in the script file
.
Here is the hex dumped file encoding (UTF-8 lower case a-acute = c3 a1):
$ grep ill test.php | od -An -t x1c
24 76 61 6c 20 3d 20 61 72 72 61 79 28 22 4d 69
$ v a l = a r r a y ( " M i
6c 6c c3 a1 6e 22 29 3b 0a
l l 303 241 n " ) ; \n
And here is the output from PHP:
$ php -f test.php | od -An -t x1c
5b 22 4d 69 6c 6c 5c 75 30 30 65 31 6e 22 5d 0a
[ " M i l l \ u 0 0 e 1 n " ] \n
The UTF-8 lower case a-acute has been changed to a "Unicode" lower case a-acute by json_encode
.
How can I keep PHP/json_encode
from switching the encoding of this variable?
EDIT: What's interesting is that if I change the string literal to utf8_encode("Millán")
then things work as expected. The utf8_encode
docs say that function only supports ISO-8859-1 input, so I'm a bit confused about why that works.
Upvotes: -1
Views: 4637
Reputation: 522522
This is entirely based on a misunderstanding. json_encode
encodes non-ASCII characters as Unicode escape sequences \u....
. These sequences do not reference any physical byte encoding in any UTF encoding, it references the character by its Unicode code point. U+00E1 is the Unicode code point for the character á
. Any proper JSON parser will decode \u00e1
back into the character "á". There's no issue here.
Upvotes: 1
Reputation: 109
try the below command to solve their problems.
<?php
$val = array("Millán");
print json_encode($val, JSON_UNESCAPED_UNICODE);
Note: add the JSON_UNESCAPED_UNICODE parameter to the json_encode function to keep the original values.
For python, this Saving utf-8 texts in json.dumps as UTF8, not as \u escape sequence
Upvotes: 0