Reputation: 21513
there are several ways to check if a string is valid UTF-8 in PHP, but has anyone actually benchmarked to check which method is faster?
ways to check that i know of (maybe it's missing something, idk):
function is_utf8_1(string $str): bool
{
return mb_check_encoding($str, 'UTF-8');
}
function is_utf8_2(string $str): bool
{
return (bool) preg_match('//u', $str);
}
function is_utf8_3(string $str): bool
{
return iconv('UTF-8', 'UTF-8//IGNORE', $str) === $str;
}
// DO NOT USE is_utf8_4, it is bugged, it incorrectly validates "\xC0\x81"
//
// in 2009 the author made the claim that
// this method is more accurate than mb_check_encoding,
// without providing any examples of where mb_check_encdoing fails and this function succeeds...
// source: https://www.php.net/manual/en/function.mb-check-encoding.php#95289
function is_utf8_4(string $str): bool
{
$len = strlen($str);
for ($i = 0; $i < $len; ++ $i) {
$c = ord($str[$i]);
if ($c > 128) {
if (($c > 247))
return false;
elseif ($c > 239)
$bytes = 4;
elseif ($c > 223)
$bytes = 3;
elseif ($c > 191)
$bytes = 2;
else
return false;
if (($i + $bytes) > $len)
return false;
while ($bytes > 1) {
++ $i;
$b = ord($str[$i]);
if ($b < 128 || $b > 191)
return false;
-- $bytes;
}
}
}
return true;
}
Upvotes: 2
Views: 748
Reputation: 924
has anyone actually benchmarked to check which method is faster?
I investigated this topic when implementing pure msgpack serialization, and the fastest way to distinguish between utf8 and non-utf8 strings I found was to use a specially crafted regex
/\A(?:
[\x00-\x7F]++ # ASCII
| [\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte
| \xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs
| [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte
| \xED[\x80-\x9F][\x80-\xBF] # excluding surrogates
| \xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3
| [\xF1-\xF3][\x80-\xBF]{3} # planes 4-15
| \xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16
)*+\z/x
which can be up to 2x faster than //u
. Here are some benchmark results I made on PHP 7.3: https://gist.github.com/rybakit/2c75152577fdcb9f4718d44e7123a539#file-output-txt.
Note, however, that pcre.jit
must be enabled to achieve this, which is usually not a problem as it is enabled (set to 1) by default.
Upvotes: 2
Reputation: 21513
in this simple non-comprehensive test, preg_match is over 32 times faster than mb_check_encoding, wow! what happened there? it's also 14 times faster than iconv, and 1344 times faster than the userland implementation
benchmarked on a dedicated server rolling Intel(R) Xeon(R) CPU E3-1240 V2 @ 3.40GHz
with PHP 7.4.13,
running with 1 million iterations yielded
root@x-ratma-net:~# time php bench2.php
Array
(
[is_utf8_1] => Array
(
[success] => 37835
[failure_early] => 37705
[failure_late] => 37632
)
[is_utf8_2] => Array
(
[success] => 1147
[failure_early] => 839
[failure_late] => 8521
)
[is_utf8_3] => Array
(
[success] => 16081
[failure_early] => 15667
[failure_late] => 15664
)
[is_utf8_4] => Array
(
[success] => 1542154
[failure_early] => 943
[failure_late] => 1542284
)
)
/root/bench2.php:91:
array(3) {
'success' =>
string(9) "is_utf8_2"
'failure_early' =>
string(9) "is_utf8_2"
'failure_late' =>
string(9) "is_utf8_2"
}
real 5m33.715s
user 5m33.364s
sys 0m0.292s
benchmark code:
<?php
function is_utf8_1(string $str): bool
{
return mb_check_encoding($str, 'UTF-8');
}
function is_utf8_2(string $str): bool
{
return (bool) preg_match('//u', $str);
}
function is_utf8_3(string $str): bool
{
return iconv('UTF-8', 'UTF-8//IGNORE', $str) === $str;
}
// DO NOT USE is_utf8_4, it is bugged, it incorrectly validates "\xC0\x81"
//
// in 2009 the author made the claim that
// this method is more accurate than mb_check_encoding,
// without providing any examples of where mb_check_encdoing fails and this function succeeds...
// source: https://www.php.net/manual/en/function.mb-check-encoding.php#95289
function is_utf8_4(string $str): bool
{
$len = strlen($str);
for ($i = 0; $i < $len; ++$i) {
$c = ord($str[$i]);
if ($c > 128) {
if (($c > 247))
return false;
elseif ($c > 239)
$bytes = 4;
elseif ($c > 223)
$bytes = 3;
elseif ($c > 191)
$bytes = 2;
else
return false;
if (($i + $bytes) > $len)
return false;
while ($bytes > 1) {
++$i;
$b = ord($str[$i]);
if ($b < 128 || $b > 191)
return false;
--$bytes;
}
}
}
return true;
}
$functions = [
"is_utf8_1",
"is_utf8_2",
"is_utf8_3",
"is_utf8_4",
];
$iterations = 1_000_000;
$results = [];
$test_strings = [];
$repeated = 10;
$test_strings["success"] = "ˈmaʳkʊs kuːn ℕ ⊆ ℕ₀ ⊂ ℤ ⊂ ℚ ⊂ ℝ ⊂ ℂ, ⊥ < a ≠ b ≡ c ≤ d ≪ ⊤ ⇒ (A ⇔ B), Σὲ γνωρίζω ἀπὸ τὴν κόψη Οὐχὶ ταὐτὰ παρίσταταί გთხოვთ ሰማይ አይታረስ ንጉሥ አይከሰስ ᚻᛖ ᚳᚹᚫᚦ ᚦᚫᛏ ᚻᛖ ᛒᚢᛞᛖ ᚩᚾ ᚦᚫᛗ ᛚᚪᚾᛞᛖ ᚾᚩᚱᚦᚹᛖᚪᚱᛞᚢᛗ ᚹᛁᚦ ᚦᚪ ᚹᛖᛥᚫ ";
$test_strings["success"] .= "♔♕♖♗♘♙♚♛♜♝♞🙾🙿";
$test_strings["success"] = str_repeat($test_strings["success"], $repeated);
$test_strings["failure_early"] = "\xFF\xFF\xFF\xFF" . $test_strings["success"];
$test_strings["failure_late"] = $test_strings["success"] . "\xFF\xFF\xFF\xFF";
foreach ($functions as $function) {
foreach ($test_strings as $test_string_name => $test_string) {
$best = PHP_FLOAT_MAX;
for ($i = 0; $i < $iterations; ++$i) {
$time = hrtime(true);
$function($test_string);
$time = hrtime(true) - $time;
$best = min($time, $best);
}
$results[$function][$test_string_name] = $best;
}
}
$winners = [];
foreach ($test_strings as $test_string_name => $_) {
$best_function_name = "";
$best_result = PHP_FLOAT_MAX;
foreach ($results as $function_name => $function_results) {
if ($best_result > $function_results[$test_string_name]) {
$best_function_name = $function_name;
$best_result = $function_results[$test_string_name];
}
}
$winners[$test_string_name] = $best_function_name;
}
print_r($results);
var_dump($winners);
Upvotes: 2