mpen
mpen

Reputation: 282915

Unicode encode string

I'm json_encoding some strings. Sometimes they contain binary data. This causes the encoding to fail with error code JSON_ERROR_UTF8. Running the strings through utf8_encode gets around this error. However, (a unicode checkmark) gets encoded as \u00e2\u009c\u0093 which when interpreted by JavaScript and rendered in your browser actually looks like â.

How can I fix this? Is there another encoding I can use?


echo json_encode(utf8_encode('✓')); // "\u00e2\u009c\u0093"

Now press F12 and paste that into your JavaScript console (quotes included). It should output â.


Please note that

echo json_encode('✓'); // "\u2713"

Works as intended. The issue is that sometimes the string will contain binary data which json_encode can't handle, so I need to sanitize every string without breaking the strings it can handle.


More examples:

json_encode(chr(200));              // false (bad)
json_encode(utf8_encode(chr(200)))  // "\u00c8" (good)
json_encode('✓');                   // "\u2713" (good)
json_encode(utf8_encode(chr(200)))  // "\u00e2\u009c\u0093" (bad)

So you see, encoding it works well for some strings and breaks others.

This is strictly for logging. I don't care if the binary data comes out weird, I just don't want it to mess with valid strings.

Upvotes: 0

Views: 618

Answers (2)

mpen
mpen

Reputation: 282915

Running strings through this function

function _utf8($str) {
    if(!mb_check_encoding($str, 'UTF-8')) {
        return utf8_encode($str);
    }
    return $str;
}

(taken and modified from here)

Seems to give the results I'm after.

Checkmarks are left alone, but chr(200) and other weirdness is encoded:

json_encode(utf8_encode(chr(200))) // "\u00c8"

Upvotes: 1

amphetamachine
amphetamachine

Reputation: 30595

EDIT: This question is unanswerable. Encoding arbitrary binary data is one thing, keeping UTF-8 characters intact is something completely separate. What's to stop the byte sequence 0xe29c93 from being interpreted as when it shows up in your binary data?

According to the json_encode PHP reference page, you can use the following syntax to encode Unicode characters:

json_encode($data, JSON_UNESCAPED_UNICODE);

It should make it pass unicode characters through unescaped.

Upvotes: 0

Related Questions