Reputation: 6562
Let's say I have the following:
$str1 = "via Tokyo";
$str2 = "via 東京";
I want to match any non-whitespace characters after the "via ". Normally I'd use the following:
preg_match("/via\s(\S+)/", $str2, $match);
to obtain the matching characters. I assumed this wouldn't work with the above due to preg_match
not understanding utf8, however it works perfectly in this case.
Is this working correctly because preg_match
is simply looking for bytes that aren't whitespace, and if so, am I safe to use this for any UTF8 characters?
PS I'm aware that I should really be using the mb_ereg
functions for this (or avoiding PHP altogether) but I'm looking for a better understanding of why this works. Thanks!
Upvotes: 0
Views: 198
Reputation: 324640
It's working because the individual bytes that make up 東
and 京
happen to not be whitespace characters in the single-byte character set. Among other things, your regex would happilly accept - - (em space) despite it being a whitespace character.
Try adding the u
modifier to the end, to enable UTF-8 support.
Upvotes: 0
Reputation: 109557
Yes, UTF-8 uses multi-byte sequences for the special Unicode characters, and it guarantees that they are different from the ASCII ones by having a high bit (undermore). So searching for slash, backslash or space will never have a false positive in a multi-byte sequence.
Upvotes: 1