Reputation: 1516
Is there a function that will change UTF-8 to Unicode leaving non special characters as normal letters and numbers?
ie the German word "tchüß" would be rendered as something like "tch\20AC\21AC" (please note that I am making the Unicode codes up).
EDIT: I am experimenting with the following function, but although this one works well with ASCII 32-127, it seems to fail for double byte chars:
function strToHex ($string)
{
$hex = '';
for ($i = 0; $i < mb_strlen ($string, "utf-8"); $i++)
{
$id = ord (mb_substr ($string, $i, 1, "utf-8"));
$hex .= ($id <= 128) ? mb_substr ($string, $i, 1, "utf-8") : "&#" . $id . ";";
}
return ($hex);
}
Any ideas?
EDIT 2: Found solution: The PHP ord() function does not work for double byte chars. Use instead: http://nl.php.net/manual/en/function.ord.php#78032
Upvotes: 6
Views: 39555
Reputation: 2889
I see many people making confusion between UTF-8, UTF-16... "Encodings" and what is UNICODE itself, that is the complete range of characters from value 0 to huge integer values that surpass the 127 ASCII limit (and can be ENCODED with UTF-8 or UTF-16 to be correctly displayed as characteres).
Note that values from 128 to 255 can be displayed also using latin Encodings such as ISO-8859-1 and windows-1252. The UTF-8 encoding has the advantage that it covers values that range from 0 to 0x10FFFF
Here goes a definitive example to fix these concepts:
The numeric value (integer/long) 129429
is represented in hexadecimal as 0x1f995
This UNICODE value (or "point" that renders a dinossaur) can be encoded as:
\uD83E\uDD95
using UTF-16
or
0xf09fa695
using UTF-8
Both will render the original dinossaur (0x1f995)
So... UNICODE is NOT UTF-8 nor UTF-16.
These are two "Encoding schemes" for Unicode points/values/characters.
Upvotes: 1
Reputation: 159
Tested on php 5.6
/**
* @param string $utf8char
* @return string
*/
function toUnicodeCodePoint($utf8char)
{
return 'U+' . dechex(mb_ord2($utf8char));
}
/**
* @see https://github.com/symfony/polyfill-mbstring
* @param string $s
* @return int
*/
function mb_ord2($s)
{
$code = ($s = unpack('C*', substr($s, 0, 4))) ? $s[1] : 0;
if (0xF0 <= $code) {
return (($code - 0xF0) << 18) + (($s[2] - 0x80) << 12) + (($s[3] - 0x80) << 6) + $s[4] - 0x80;
}
if (0xE0 <= $code) {
return (($code - 0xE0) << 12) + (($s[2] - 0x80) << 6) + $s[3] - 0x80;
}
if (0xC0 <= $code) {
return (($code - 0xC0) << 6) + $s[2] - 0x80;
}
return $code;
}
echo toUnicodeCodePoint('😓');
// U+1f613
Upvotes: 3
Reputation:
For people looking to find the Unicode Code Point for any character this might be useful. You can then encode the string in whatever you want, replacing certain characters with escape codes, and leaving others in their binary form (eg. ascii printable characters), depending on the context in which you want to use it.
From: Mapping codepoints to Unicode encoding forms
The mapping for UTF-32 is, essentially, the identity mapping: the 32-bit code unit used to encode a codepoint has the same integer value as the codepoint itself.
/**
* Convert a string into an array of decimal Unicode code points.
*
* @param $string [string] The string to convert to codepoints
* @param $encoding [string] The encoding of $string
*
* @return [array] Array of decimal codepoints for every character of $string
*/
function toCodePoint( $string, $encoding )
{
$utf32 = mb_convert_encoding( $string, 'UTF-32', $encoding );
$length = mb_strlen( $utf32, 'UTF-32' );
$result = [];
for( $i = 0; $i < $length; ++$i )
$result[] = hexdec( bin2hex( mb_substr( $utf32, $i, 1, 'UTF-32' ) ) );
return $result;
}
Upvotes: 10
Reputation: 1852
With PHP 7, there is a new IntlChar::ord() to find the Unicode Code Point from a given UTF-8 character:
var_dump(sprintf('U+%04X', IntlChar::ord('ß')));
# Outputs: string(6) "U+00DF"
Upvotes: 11
Reputation: 188
I had a problem when i need to convert string (utf-8 in default) with cyrilic to entities partly - only cyrilic. Finaly i need to get JSON-like result, like this:
<li class="my_class">City - Mocsow (Москва)</li>
to this:
<li class=\"my_class\">City - Mocsow (\u041c\u043e\u0441\u043a\u0432\u0430)<\/li>
So, i`ve got a compex (mix of subj. author and Nus) solution:
function strToHex($string){
$enc="utf-8";
$hex = '';
for ($i = 0; $i < mb_strlen ($string, $enc); $i++){
$id = ord (mb_substr ($string, $i, 1, $enc));
$hex .= ($id <= 128) ? mb_substr ($string, $i, 1, $enc) : toCodePoint(mb_substr ($string, $i, 1, $enc), $enc);
}
return $hex;
}
function toCodePoint($string, $encoding){
$utf32 = mb_convert_encoding( $string, 'UTF-32', $encoding );
$length = mb_strlen( $utf32, 'UTF-32' );
$result = Array();
for( $i = 0; $i < $length; ++$i )$result[] = "\u".substr(bin2hex( mb_substr( $utf32, $i, 1, 'UTF-32' ) ), 4,8);
return implode("", $result);
}
$output=strToHex(
str_replace( // this is for json compatible
array("\"", "\n", "\r", "\t", "/"),
array('\"', '\n', "", " ", "\/"),
$text
)
);
echo $output;
It tested on php 5.2.17 :)
Upvotes: 0
Reputation: 3034
Converting one character set to another can be done with iconv:
http://php.net/manual/en/function.iconv.php
Note that UTF is already an Unicode encoding.
Another way is simply using htmlentities with the right character set:
http://php.net/manual/en/function.htmlentities.php
Upvotes: 3
Reputation: 536339
For a readable-form I would go with JSON. It's not required to escape non-ASCII characters in JSON, but PHP does:
echo json_encode("tchüß");
"tch\u00fc\u00df"
Upvotes: 28
Reputation: 41040
I once created a function called _convert() which encodes safely everything to UTF-8.
Upvotes: 2
Reputation: 49
I guess you're going to print out your strings on a website?
I'm storing all my databases in uft8, using html_entities($string) before output.
Maybe you have to try html_entities(utf8_encode($string));
Upvotes: 2