Reputation: 139892
For example,
how to get the character corresponding to U+010F?
Upvotes: 16
Views: 18416
Reputation: 40683
In case this is useful to anyone, PHP 7.2 has now added the mb_ord
and mb_chr
equivalents to ord
and chr
. For example the following code will work in PHP 7.2
$charPoint = "U+010F";
echo mb_chr(hexdec(substr($charPoint,2)); // prints ď
This has 2 side effects (a) One does not need to implement their own and (b) if one has already implemented their own they need to wrap it in a if (!function_exists('mb_chr'))
Upvotes: 3
Reputation: 213
<?php
function chr_utf8($n,$f='C*'){
return $n<(1<<7)?chr($n):($n<1<<11?pack($f,192|$n>>6,1<<7|191&$n):
($n<(1<<16)?pack($f,224|$n>>12,1<<7|63&$n>>6,1<<7|63&$n):
($n<(1<<20|1<<16)?pack($f,240|$n>>18,1<<7|63&$n>>12,1<<7|63&$n>>6,1<<7|63&$n):'')));
}
echo chr_utf8(hexdec('010F'));
// Output the UTF-8 character corresponding to U+010F
Upvotes: 0
Reputation: 4640
IntlChar is a new ICU-based builtin class released with PHP/7 that exactly addresses this problem:
IntlChar provides access to a number of utility methods that can be used to access information about Unicode characters.
// PHP 7.0 and later
var_dump(
"\u{010F}" === IntlChar::chr(0x010F),
0x010F === IntlChar::ord("\u{010F}")
);
// PHP 7.2.0-dev
var_dump(
"\u{010F}" === mb_chr(0x010F, "UTF-8"),
0x010F === mb_ord("\u{010F}", "UTF-8")
);
Upvotes: 2
Reputation: 47101
I just wrote a polyfill
for missing multibyte versions of ord
and chr
with the following in mind:
It defines functions mb_ord
and mb_chr
only if they don't already exist. If they do exist in your framework or some future version of PHP, the polyfill will be ignored.
It uses the widely used mbstring
extension to do the conversion. If the mbstring
extension is not loaded, it will use the iconv
extension instead.
EDIT :
I added functions for HTMLentities encoding / decoding and encoding / decoding to JSON format as well as some demo code for how to use these functions
if (!function_exists('codepoint_encode')) {
function codepoint_encode($str) {
return substr(json_encode($str), 1, -1);
}
}
if (!function_exists('codepoint_decode')) {
function codepoint_decode($str) {
return json_decode(sprintf('"%s"', $str));
}
}
if (!function_exists('mb_internal_encoding')) {
function mb_internal_encoding($encoding = NULL) {
return ($from_encoding === NULL) ? iconv_get_encoding() : iconv_set_encoding($encoding);
}
}
if (!function_exists('mb_convert_encoding')) {
function mb_convert_encoding($str, $to_encoding, $from_encoding = NULL) {
return iconv(($from_encoding === NULL) ? mb_internal_encoding() : $from_encoding, $to_encoding, $str);
}
}
if (!function_exists('mb_chr')) {
function mb_chr($ord, $encoding = 'UTF-8') {
if ($encoding === 'UCS-4BE') {
return pack("N", $ord);
} else {
return mb_convert_encoding(mb_chr($ord, 'UCS-4BE'), $encoding, 'UCS-4BE');
}
}
}
if (!function_exists('mb_ord')) {
function mb_ord($char, $encoding = 'UTF-8') {
if ($encoding === 'UCS-4BE') {
list(, $ord) = (strlen($char) === 4) ? @unpack('N', $char) : @unpack('n', $char);
return $ord;
} else {
return mb_ord(mb_convert_encoding($char, 'UCS-4BE', $encoding), 'UCS-4BE');
}
}
}
if (!function_exists('mb_htmlentities')) {
function mb_htmlentities($string, $hex = true, $encoding = 'UTF-8') {
return preg_replace_callback('/[\x{80}-\x{10FFFF}]/u', function ($match) use ($hex) {
return sprintf($hex ? '&#x%X;' : '&#%d;', mb_ord($match[0]));
}, $string);
}
}
if (!function_exists('mb_html_entity_decode')) {
function mb_html_entity_decode($string, $flags = null, $encoding = 'UTF-8') {
return html_entity_decode($string, ($flags === NULL) ? ENT_COMPAT | ENT_HTML401 : $flags, $encoding);
}
}
echo "Get string from numeric DEC value\n";
var_dump(mb_chr(50319, 'UCS-4BE'));
var_dump(mb_chr(271));
echo "\nGet string from numeric HEX value\n";
var_dump(mb_chr(0xC48F, 'UCS-4BE'));
var_dump(mb_chr(0x010F));
echo "\nGet numeric value of character as DEC string\n";
var_dump(mb_ord('ď', 'UCS-4BE'));
var_dump(mb_ord('ď'));
echo "\nGet numeric value of character as HEX string\n";
var_dump(dechex(mb_ord('ď', 'UCS-4BE')));
var_dump(dechex(mb_ord('ď')));
echo "\nEncode / decode to DEC based HTML entities\n";
var_dump(mb_htmlentities('tchüß', false));
var_dump(mb_html_entity_decode('tchüß'));
echo "\nEncode / decode to HEX based HTML entities\n";
var_dump(mb_htmlentities('tchüß'));
var_dump(mb_html_entity_decode('tchüß'));
echo "\nUse JSON encoding / decoding\n";
var_dump(codepoint_encode("tchüß"));
var_dump(codepoint_decode('tch\u00fc\u00df'));
Get string from numeric DEC value
string(4) "ď"
string(2) "ď"
Get string from numeric HEX value
string(4) "ď"
string(2) "ď"
Get numeric value of character as DEC int
int(50319)
int(271)
Get numeric value of character as HEX string
string(4) "c48f"
string(3) "10f"
Encode / decode to DEC based HTML entities
string(15) "tchüß"
string(7) "tchüß"
Encode / decode to HEX based HTML entities
string(15) "tchüß"
string(7) "tchüß"
Use JSON encoding / decoding
string(15) "tch\u00fc\u00df"
string(7) "tchüß"
Upvotes: 5
Reputation: 13930
If you control the UTF-8 encode of your strings (as recommended by latin and other european standards), you need only
html_entity_decode($string, ENT_COMPAT, 'UTF-8');
See Example #1 of php man. You can change the second parameter to ENT_NOQUOTES, etc. and, pay attetion, use ENT_XHTML, etc. if your string is a Markup Language (!).
Upvotes: 0
Reputation: 154573
header('Content-Encoding: UTF-8');
function mb_html_entity_decode($string)
{
if (extension_loaded('mbstring') === true)
{
mb_language('Neutral');
mb_internal_encoding('UTF-8');
mb_detect_order(array('UTF-8', 'ISO-8859-15', 'ISO-8859-1', 'ASCII'));
return mb_convert_encoding($string, 'UTF-8', 'HTML-ENTITIES');
}
return html_entity_decode($string, ENT_COMPAT, 'UTF-8');
}
function mb_ord($string)
{
if (extension_loaded('mbstring') === true)
{
mb_language('Neutral');
mb_internal_encoding('UTF-8');
mb_detect_order(array('UTF-8', 'ISO-8859-15', 'ISO-8859-1', 'ASCII'));
$result = unpack('N', mb_convert_encoding($string, 'UCS-4BE', 'UTF-8'));
if (is_array($result) === true)
{
return $result[1];
}
}
return ord($string);
}
function mb_chr($string)
{
return mb_html_entity_decode('&#' . intval($string) . ';');
}
var_dump(hexdec('010F'));
var_dump(mb_ord('ó')); // 243
var_dump(mb_chr(243)); // ó
Upvotes: 25