hanshenrik
hanshenrik

Reputation: 21513

PHP fastest way to check if string is UTF-8?

there are several ways to check if a string is valid UTF-8 in PHP, but has anyone actually benchmarked to check which method is faster?

ways to check that i know of (maybe it's missing something, idk):

function is_utf8_1(string $str): bool
{
    return mb_check_encoding($str, 'UTF-8');
}

function is_utf8_2(string $str): bool
{
    return (bool) preg_match('//u', $str);
}

function is_utf8_3(string $str): bool
{
    return iconv('UTF-8', 'UTF-8//IGNORE', $str) === $str;
}


// DO NOT USE is_utf8_4, it is bugged, it incorrectly validates "\xC0\x81"
//
// in 2009 the author made the claim that
//  this method is more accurate than mb_check_encoding,
// without providing any examples of where mb_check_encdoing fails and this function succeeds...
// source: https://www.php.net/manual/en/function.mb-check-encoding.php#95289
function is_utf8_4(string $str): bool
{
    $len = strlen($str);
    for ($i = 0; $i < $len; ++ $i) {
        $c = ord($str[$i]);
        if ($c > 128) {
            if (($c > 247))
                return false;
            elseif ($c > 239)
                $bytes = 4;
            elseif ($c > 223)
                $bytes = 3;
            elseif ($c > 191)
                $bytes = 2;
            else
                return false;
            if (($i + $bytes) > $len)
                return false;
            while ($bytes > 1) {
                ++ $i;
                $b = ord($str[$i]);
                if ($b < 128 || $b > 191)
                    return false;
                -- $bytes;
            }
        }
    }
    return true;
}

Upvotes: 2

Views: 748

Answers (2)

Eugene Leonovich
Eugene Leonovich

Reputation: 924

has anyone actually benchmarked to check which method is faster?

I investigated this topic when implementing pure msgpack serialization, and the fastest way to distinguish between utf8 and non-utf8 strings I found was to use a specially crafted regex

/\A(?:
      [\x00-\x7F]++                      # ASCII
    | [\xC2-\xDF][\x80-\xBF]             # non-overlong 2-byte
    |  \xE0[\xA0-\xBF][\x80-\xBF]        # excluding overlongs
    | [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}  # straight 3-byte
    |  \xED[\x80-\x9F][\x80-\xBF]        # excluding surrogates
    |  \xF0[\x90-\xBF][\x80-\xBF]{2}     # planes 1-3
    | [\xF1-\xF3][\x80-\xBF]{3}          # planes 4-15
    |  \xF4[\x80-\x8F][\x80-\xBF]{2}     # plane 16
)*+\z/x

which can be up to 2x faster than //u. Here are some benchmark results I made on PHP 7.3: https://gist.github.com/rybakit/2c75152577fdcb9f4718d44e7123a539#file-output-txt.

Note, however, that pcre.jit must be enabled to achieve this, which is usually not a problem as it is enabled (set to 1) by default.

Upvotes: 2

hanshenrik
hanshenrik

Reputation: 21513

in this simple non-comprehensive test, preg_match is over 32 times faster than mb_check_encoding, wow! what happened there? it's also 14 times faster than iconv, and 1344 times faster than the userland implementation

benchmarked on a dedicated server rolling Intel(R) Xeon(R) CPU E3-1240 V2 @ 3.40GHz with PHP 7.4.13,

running with 1 million iterations yielded

root@x-ratma-net:~# time php bench2.php
Array
(
    [is_utf8_1] => Array
        (
            [success] => 37835
            [failure_early] => 37705
            [failure_late] => 37632
        )

    [is_utf8_2] => Array
        (
            [success] => 1147
            [failure_early] => 839
            [failure_late] => 8521
        )

    [is_utf8_3] => Array
        (
            [success] => 16081
            [failure_early] => 15667
            [failure_late] => 15664
        )

    [is_utf8_4] => Array
        (
            [success] => 1542154
            [failure_early] => 943
            [failure_late] => 1542284
        )

)
/root/bench2.php:91:
array(3) {
  'success' =>
  string(9) "is_utf8_2"
  'failure_early' =>
  string(9) "is_utf8_2"
  'failure_late' =>
  string(9) "is_utf8_2"
}

real    5m33.715s
user    5m33.364s
sys     0m0.292s

benchmark code:

<?php


function is_utf8_1(string $str): bool
{
    return mb_check_encoding($str, 'UTF-8');
}

function is_utf8_2(string $str): bool
{
    return (bool) preg_match('//u', $str);
}

function is_utf8_3(string $str): bool
{
    return iconv('UTF-8', 'UTF-8//IGNORE', $str) === $str;
}


// DO NOT USE is_utf8_4, it is bugged, it incorrectly validates "\xC0\x81"
//
// in 2009 the author made the claim that
//  this method is more accurate than mb_check_encoding,
// without providing any examples of where mb_check_encdoing fails and this function succeeds...
// source: https://www.php.net/manual/en/function.mb-check-encoding.php#95289
function is_utf8_4(string $str): bool
{
    $len = strlen($str);
    for ($i = 0; $i < $len; ++$i) {
        $c = ord($str[$i]);
        if ($c > 128) {
            if (($c > 247))
                return false;
            elseif ($c > 239)
                $bytes = 4;
            elseif ($c > 223)
                $bytes = 3;
            elseif ($c > 191)
                $bytes = 2;
            else
                return false;
            if (($i + $bytes) > $len)
                return false;
            while ($bytes > 1) {
                ++$i;
                $b = ord($str[$i]);
                if ($b < 128 || $b > 191)
                    return false;
                --$bytes;
            }
        }
    }
    return true;
}

$functions = [
    "is_utf8_1",
    "is_utf8_2",
    "is_utf8_3",
    "is_utf8_4",
];
$iterations = 1_000_000;
$results = [];
$test_strings = [];
$repeated = 10;
$test_strings["success"] = "ˈmaʳkʊs kuːn ℕ ⊆ ℕ₀ ⊂ ℤ ⊂ ℚ ⊂ ℝ ⊂ ℂ, ⊥ < a ≠ b ≡ c ≤ d ≪ ⊤ ⇒ (A ⇔ B), Σὲ γνωρίζω ἀπὸ τὴν κόψη Οὐχὶ ταὐτὰ παρίσταταί გთხოვთ ሰማይ አይታረስ ንጉሥ አይከሰስ ᚻᛖ ᚳᚹᚫᚦ ᚦᚫᛏ ᚻᛖ ᛒᚢᛞᛖ ᚩᚾ ᚦᚫᛗ ᛚᚪᚾᛞᛖ ᚾᚩᚱᚦᚹᛖᚪᚱᛞᚢᛗ ᚹᛁᚦ ᚦᚪ ᚹᛖᛥᚫ ";
$test_strings["success"] .= "♔♕♖♗♘♙♚♛♜♝♞🙾🙿";
$test_strings["success"] = str_repeat($test_strings["success"], $repeated);
$test_strings["failure_early"] = "\xFF\xFF\xFF\xFF" . $test_strings["success"];
$test_strings["failure_late"] = $test_strings["success"] . "\xFF\xFF\xFF\xFF";
foreach ($functions as $function) {
    foreach ($test_strings as $test_string_name => $test_string) {
        $best = PHP_FLOAT_MAX;
        for ($i = 0; $i < $iterations; ++$i) {
            $time = hrtime(true);
            $function($test_string);
            $time = hrtime(true) - $time;
            $best = min($time, $best);
        }
        $results[$function][$test_string_name] = $best;
    }
}
$winners = [];
foreach ($test_strings as $test_string_name => $_) {
    $best_function_name = "";
    $best_result = PHP_FLOAT_MAX;
    foreach ($results as $function_name => $function_results) {
        if ($best_result > $function_results[$test_string_name]) {
            $best_function_name = $function_name;
            $best_result = $function_results[$test_string_name];
        }
    }
    $winners[$test_string_name] = $best_function_name;
}
print_r($results);
var_dump($winners);

Upvotes: 2

Related Questions