Linus Kleen
Linus Kleen

Reputation: 34632

Non-ASCII characters in UTF-8 mode regular expression

Question

Despite the PHP manual stating:

"In UTF-8 mode, characters with values greater than 128 do not match any of the POSIX character classes."

Why do Persian digits match \d or [[:digit:]] in "UTF-8 mode"?

Elaboration

In an answerer's remark in a non-related question it is mentioned that in regular expressions, \d does not only match ASCII digits 0 thru 9 but also, for example, Persian digits (۰ ۱ ۲ ۳ ۴ ۵ ۶ ۷).

The above mentioned question is tagged but the behavior can be observed in PHP as well. With this in mind I wrote the following "test":

$string = 'I have ۳ apples and 5 oranges';
preg_match_all('/\d+/', $string, $capture);

The resulting array $capture contains a match on 5 only.

Using the u modifier to turn on "UTF-8 mode" and running this:

$string = 'I have ۳ apples and 5 oranges';
preg_match_all('/\d+/u', $string, $capture);

results in $capture containing matches on both ۳ and 5.

Notes

Upvotes: 3

Views: 2894

Answers (1)

Lucas Trzesniewski
Lucas Trzesniewski

Reputation: 51330

Because the documentation is broken. And it's not the only place where it is so, unfortunately.

PHP uses PCRE under the hood to implement its preg_* functions. PCRE's documentation is thus authoritative there. PHP's documentation is based on PCRE's, but it looks like you found yet another mistake.

Here's what you can read in PCRE's docs (emphasis mine):

By default, characters with values greater than 128 do not match any of the POSIX character classes. However, if the PCRE_UCP option is passed to pcre_compile(), some of the classes are changed so that Unicode character properties are used. This is achieved by replacing certain POSIX classes by other sequences, as follows:

[:alnum:]  becomes  \p{Xan}
[:alpha:]  becomes  \p{L}
[:blank:]  becomes  \h
[:digit:]  becomes  \p{Nd}
[:lower:]  becomes  \p{Ll}
[:space:]  becomes  \p{Xps}
[:upper:]  becomes  \p{Lu}
[:word:]   becomes  \p{Xwd}

If you dig further in PHP's docs, you'll find the following:

u (PCRE_UTF8)

This modifier turns on additional functionality of PCRE that is incompatible with Perl. Pattern and subject strings are treated as UTF-8. This modifier is available from PHP 4.1.0 or greater on Unix and from PHP 4.2.3 on win32. UTF-8 validity of the pattern and the subject is checked since PHP 4.3.5. An invalid subject will cause the preg_* function to match nothing; an invalid pattern will trigger an error of level E_WARNING. Five and six octet UTF-8 sequences are regarded as invalid since PHP 5.3.4 (resp. PCRE 7.3 2007-08-28); formerly those have been regarded as valid UTF-8.

This is, unfortunately, a lie. The u modifier in PHP means PCRE_UTF8 | PCRE_UCP (UCP stands for Unicode Character Properties). The PCRE_UCP flag is the one that changes the meaning of \d, \w and the like, as you can see from the docs above. Your tests confirm that.


As a side note, don't infer properties of one regex flavor from another. It doesn't always work (heh, even this chart forgot about the PCRE_UCP option).

Upvotes: 3

Related Questions