Graham
Graham

Reputation: 6562

preg_match with UTF8

Let's say I have the following:

$str1 = "via Tokyo";
$str2 = "via 東京";

I want to match any non-whitespace characters after the "via ". Normally I'd use the following:

preg_match("/via\s(\S+)/", $str2, $match);

to obtain the matching characters. I assumed this wouldn't work with the above due to preg_match not understanding utf8, however it works perfectly in this case.

Is this working correctly because preg_match is simply looking for bytes that aren't whitespace, and if so, am I safe to use this for any UTF8 characters?

PS I'm aware that I should really be using the mb_ereg functions for this (or avoiding PHP altogether) but I'm looking for a better understanding of why this works. Thanks!

Upvotes: 0

Views: 198

Answers (2)

Niet the Dark Absol
Niet the Dark Absol

Reputation: 324640

It's working because the individual bytes that make up and happen to not be whitespace characters in the single-byte character set. Among other things, your regex would happilly accept - - (em space) despite it being a whitespace character.

Try adding the u modifier to the end, to enable UTF-8 support.

Upvotes: 0

Joop Eggen
Joop Eggen

Reputation: 109557

Yes, UTF-8 uses multi-byte sequences for the special Unicode characters, and it guarantees that they are different from the ASCII ones by having a high bit (undermore). So searching for slash, backslash or space will never have a false positive in a multi-byte sequence.

Upvotes: 1

Related Questions