Reputation: 4135
I have this black box that spits out a JSON, and this file comes with what I assume, are escaped Unicode characters. Here's a snippet:
{
"AR_DESCRI":"LIMA CENTIMETRADA\/FORMAS U\u00c3\u2018AS 100\/180 MANI."
}
Now, here's how the resulting JSON should actually look like to any reasonable human being:
{
"AR_DESCRI":"LIMA CENTIMETRADA/FORMAS UÑAS 100/180 MANI."
}
The most importat thing there is that \u00c3\u2018
should equal the Ñ
character.
However as you can check from any Unicode Escape Sequence decoder, this is not the case, the ouput for \u00c3\u2018
is actually Ñ
which is basically random noise.
I've tried some online decoders and I've also used the json_decode()
PHP functions, which is the enviroment I'm currently working on. Both give me the same results. Here's the snippet of code if you are curious:
<?php
$json = '{"AR_DESCRI":"LIMA CENTIMETRADA\/FORMAS U\u00c3\u2018AS 100\/180 MANI."}';
print_r(json_decode($json));
//Output: stdClass Object ( [AR_DESCRI] => LIMA CENTIMETRADA/FORMAS UÑAS 100/180 MANI. )
So my question is, why on earth does this happen, is it an encoding issue on the black box's side? Am I using the wrong function?
Thanks in advance.
Upvotes: 1
Views: 596
Reputation: 32232
Ñ
is U+00D1
represented in UTF8 as the literal bytes \xc3\x91
.
What you've got there is Mojibake caused by incorrectly forcing a cp1252-to-UTF conversion on the input string where in cp1252 \xc3
is Ã
and \x91
is ‘
. [left single-quote]
These are then converted into their UTF equivalent escapes as the \u00c3\u2018
you see.
Proof:
function ordify($str) {
return implode(' ', array_map(
function($a){return sprintf('U+%04x', mb_ord($a));},
preg_split('//u', $str, null, PREG_SPLIT_NO_EMPTY)
));
}
$borked = 'Ñ';
$fixed = mb_convert_encoding($borked, 'cp1252', 'utf-8');
var_dump(
$borked, ordify($borked),
$fixed, ordify($fixed)
);
Output:
string(5) "Ñ"
string(13) "U+00c3 U+2018"
string(2) "Ñ"
string(6) "U+00d1"
So go fix the thing that's generating your JSON, because any reasonable human being should value producing valid data in the first place over kludging in a bandaid solution.
Upvotes: 0