undefined
undefined

Reputation: 4135

Unexpected results when decoding Unicode Escape Sequences

I have this black box that spits out a JSON, and this file comes with what I assume, are escaped Unicode characters. Here's a snippet:

{
    "AR_DESCRI":"LIMA CENTIMETRADA\/FORMAS U\u00c3\u2018AS 100\/180 MANI."
}

Now, here's how the resulting JSON should actually look like to any reasonable human being:

{
    "AR_DESCRI":"LIMA CENTIMETRADA/FORMAS UÑAS 100/180 MANI."
}

The most importat thing there is that \u00c3\u2018 should equal the Ñ character.

However as you can check from any Unicode Escape Sequence decoder, this is not the case, the ouput for \u00c3\u2018 is actually Ñ which is basically random noise.

I've tried some online decoders and I've also used the json_decode() PHP functions, which is the enviroment I'm currently working on. Both give me the same results. Here's the snippet of code if you are curious:

<?php
$json = '{"AR_DESCRI":"LIMA CENTIMETRADA\/FORMAS U\u00c3\u2018AS 100\/180 MANI."}';
print_r(json_decode($json));

//Output: stdClass Object ( [AR_DESCRI] => LIMA CENTIMETRADA/FORMAS UÑAS 100/180 MANI. )

So my question is, why on earth does this happen, is it an encoding issue on the black box's side? Am I using the wrong function?

Thanks in advance.

Upvotes: 1

Views: 596

Answers (1)

Sammitch
Sammitch

Reputation: 32232

Ñ is U+00D1 represented in UTF8 as the literal bytes \xc3\x91.

What you've got there is Mojibake caused by incorrectly forcing a cp1252-to-UTF conversion on the input string where in cp1252 \xc3 is à and \x91 is . [left single-quote]

These are then converted into their UTF equivalent escapes as the \u00c3\u2018 you see.

Proof:

function ordify($str) {
    return implode(' ', array_map(
        function($a){return sprintf('U+%04x', mb_ord($a));},
        preg_split('//u', $str, null, PREG_SPLIT_NO_EMPTY)
    ));
}

$borked = 'Ñ';
$fixed  = mb_convert_encoding($borked, 'cp1252', 'utf-8');

var_dump(
    $borked, ordify($borked),
    $fixed,  ordify($fixed)
);

Output:

string(5) "Ñ"
string(13) "U+00c3 U+2018"
string(2) "Ñ"
string(6) "U+00d1"

So go fix the thing that's generating your JSON, because any reasonable human being should value producing valid data in the first place over kludging in a bandaid solution.

Upvotes: 0

Related Questions