daniel
daniel

Reputation:

Converting Unicode code points to UTF-8

Currently I have something like this \u4eac\u90fd and I want to convert it to UTF-8 so I can insert it into a database.

Upvotes: 1

Views: 1765

Answers (3)

lm713
lm713

Reputation: 342

json_decode('"\u4eac\u90fd"');

Credit for using JSON @bobince https://stackoverflow.com/a/7107750 where the reverse is sought (UTF-8 to code points). There ASCII characters will not be converted to code points, but with json_decode, ASCII code points will be converted to characters, e.g. '"\u0041"' -> 'A'.

(Remember that you need the double quotes inside your string. I was confused why json_decode('\u4eac\u90fd'); was giving no output :-)

Note there will be special requirements for 4-byte UTF-8 encodings, where the code point consists of 5 or 6 hexadecimal digits. JSON doesn't use curly braces.

echo json_encode('𐍈');
//output: "\ud800\udf48"

𐍈 is U+10348. The separation into two code points is not obvious to me. Please research if dealing with 4-byte UTF-8 encodings (e.g. emoticons).

This is one of those frustrating examples of where a standard purpose-made function should exist* but instead one has to use a workaround and finds many complicated user functions online.

*The function does exist in PHP7 (http://php.net/manual/en/intlchar.chr.php), but you need to have the intl extension installed, which I do not believe it is by default.

Upvotes: 0

troelskn
troelskn

Reputation: 117427

http://hsivonen.iki.fi/php-utf8/

Upvotes: 2

Martin v. Löwis
Martin v. Löwis

Reputation: 127447

Most likely, the \u escape sequence was already sent by the web browser. This would be the original source of your problem - you need to make the web browser stop doing that.

For that, you need to make sure that the browser knows what encoding to use when submitting the form. The browser will, by default, always use the encoding of the HTML page that contains the form. Make sure that this web page is encoded in UTF-8, and has an UTF-8 charset declaration in a meta header. With that done, the browser should submit UTF-8 data correctly, and you shouldn't need to convert anything at all.

Upvotes: 2

Related Questions