showkey
showkey

Reputation: 328

How to display the utf-8 and unicode value of chinese character with php?

In python, I can get a utf-8 content and Unicode of Chinese character.

python version 3.4
>>> print("你".encode("utf-8"))
b'\xe4\xbd\xa0'
>>> print("你".encode("unicode-escape"))
b'\\u4f60'

How do I display the utf-8 and unicode value of (means you in english) in the web with php? How do I get the same output '\xe4\xbd\xa0' and \\u4f60 in firefox with php as I can with python?

Upvotes: 1

Views: 2466

Answers (1)

David
David

Reputation: 383

The first example is displaying the utf-8 encoded bytes; therefore, assuming the string is utf-8 encoded, you can simply print the hexadecimal ACSII values for each character.

$str = "你";

foreach (str_split($str) as $byte) {
    echo '\\x'.str_pad(dechex(ord($byte)), 2, '0', STR_PAD_LEFT);
}

// prints: \xe4\xbd\xa0

The second line is printing the Unicode code point for the character. Since php strings are only single byte, we must first decode the bytes to access the code point, then format the hexadecimal number.

Based on the WHATWG Encoding Standard we can make a utf-8 decoder to decode code points including supplementary code points.

// Decodes a utf-8 encoded string and returns an array
// of code points or null if there was an error
// https://encoding.spec.whatwg.org/#utf-8-decoder
function decode_utf8($str)
{
    $code_point = 0;
    $bytes_needed = 0;
    $bytes_seen = 0;

    $lower_boundary = 0x80;
    $upper_boundary = 0xbf;

    $code_points = array();

    for ($i = 0, $len = strlen($str); $i < $len; $i++) {
        $byte = ord($str[$i]);

        if ($bytes_needed == 0) {
            if ($byte >= 0x00 and $byte <= 0x7f) {
                $code_points[] = $byte;
            } elseif ($byte >= 0xc2 and $byte <= 0xdf) {
                $bytes_needed = 1;
                $code_point = $byte - 0xc0;
            } elseif ($byte >= 0xe0 and $byte <= 0xef) {
                if ($byte == 0xe0) {
                    $lower_boundary = 0xa0;
                }
                if ($byte == 0xed) {
                    $upper_boundary = 0x9f;
                }

                $bytes_needed = 2;
                $code_point = $byte - 0xe0;
            }  elseif ($byte >= 0xf0 and $byte <= 0xf4) {
                if ($byte == 0xf0) {
                    $lower_boundary = 0x90;
                }
                if ($byte == 0xf4) {
                    $upper_boundary = 0x8f;
                }

                $bytes_needed = 3;
                $code_point = $byte - 0xf0;
            }  else {
                return;
            }

            $code_point = $code_point << (6 * $bytes_needed);
            continue;
        }

        if ($byte < $lower_boundary or $byte > $upper_boundary) {
            return;
        }

        $lower_boundary = 0x80;
        $upper_boundary = 0xbf;

        $bytes_seen++;
        $code_point += ($byte - 0x80) << (6 * ($bytes_needed - $bytes_seen));

        if ($bytes_seen != $bytes_needed) {
            continue;
        }

        $code_points[] = $code_point;

        $code_point = 0;
        $bytes_needed = 0;
        $bytes_seen = 0;
    }

    if ($bytes_needed != 0) {
        return;
    }

    return $code_points;
}

Once we decode the code points, we convert them to hexadecimal with dechex. Then, using str_pad, we left pad them with zeros. If the code point is in the basic multilingual plain, we pad it so that it is four characters long, otherwise we make it six characters long. Finally, we prepend the \u at the beginning.

$str = '你';

foreach (decode_utf8($str) as $code_point) {
    echo '\\u'.str_pad(dechex($code_point), $code_point>0xffff?6:4, '0', STR_PAD_LEFT);
}
// prints: \u4f60

It also work for characters out side the basic multilingual plain, such as CJK Ideograph Extensions.

$str = '𠀀'; // U+020000

foreach (decode_utf8($str) as $code_point) {
    echo '\\u'.str_pad(dechex($code_point), $code_point>0xffff?6:4, '0', STR_PAD_LEFT);
}
// prints: \u020000

Upvotes: 2

Related Questions