ahmd0
ahmd0

Reputation: 17303

Parse UTF-8 string char-by-char in PHP

I'm sorry if I'm asking the obvious, but I can't seem to find a working solution for a simple task. On the input I have a string, provided by a user, encoded with UTF-8 encoding. I need to sanitize it by removing all characters less than 0x20 (or space), except 0x7 (or tab.)

The following works for ANSI strings, but not for UTF-8:

$newName = "";
$ln = strlen($name);
for($i = 0; $i < $ln; $i++)
{
    $ch = substr($name, $i, 1);
    $och = ord($ch);
    if($och >= 0x20 ||
        $och == 0x9)
    {
        $newName .= $ch;
    }
}

It totally missed UTF-8 encoded characters and treats them as bytes. I keep finding posts where people suggest using mb_ functions, but that still doesn't help me. (For instance, I tried calling mb_strlen($name, "utf-8"); instead of strlen, but it still returns the length of string in BYTEs instead of characters.)

Any idea how to do this in PHP?

PS. Sorry, my PHP is somewhat rusty.

Upvotes: 1

Views: 453

Answers (2)

ahmd0
ahmd0

Reputation: 17303

Wow, PHP is one messed up language. Here's what worked for me (but how much slower will this run for a longer chunk of text...):

function normalizeName($name, $encoding_2_use, $encoding_used)
{
    //'$name' = string to normalize
    //          INFO: Must be encoded with '$encoding_used' encoding
    //'$encoding_2_use' = encoding to use for return string (example: "utf-8")
    //'$encoding_used' = encoding used to encode '$name' (can be also "utf-8")
    //RETURN:
    //      = Name normalized, or
    //      = "" if error
    $resName = "";

    $ln = mb_strlen($name, $encoding_used);
    if($ln !== false)
    {
        for($i = 0; $i < $ln; $i++)
        {
            $ch = mb_substr($name, $i, 1, $encoding_used);

            $arp = unpack('N', mb_convert_encoding($ch, 'UCS-4BE', $encoding_used));
            if(count($arp) >= 1)
            {
                $och = intval($arp[1]);    //Index 1?! I don't understand why, but it works...
                if($och >= 0x20 || $och == 0x9)
                {
                    $ch2 = mb_convert_encoding('&#'.$och.';', $encoding_2_use, 'HTML-ENTITIES');
                    $resName .= $ch2;
                }
            }
        }
    }

    return $resName;
}

Upvotes: 0

Sverri M. Olsen
Sverri M. Olsen

Reputation: 13283

If you use multibyte functions (mb_) then you have to use them for everything. In this example you should use mb_strlen() and mb_substr().

The reason it is not working is probably because you are using ord(). It only works with ASCII values:

ord
(PHP 4, PHP 5)
ord — Return ASCII value of character
...
Returns the ASCII value of the first character of string.

In other words, if you throw a multibyte character into ord() it will only use the first byte, and throw away the rest.

Upvotes: 1

Related Questions