Reputation: 34632
Despite the PHP manual stating:
Why do Persian digits match \d
or [[:digit:]]
in "UTF-8 mode"?
In an answerer's remark in a non-related question it is mentioned that in regular expressions, \d
does not only match ASCII digits 0
thru 9
but also, for example, Persian digits (۰ ۱ ۲ ۳ ۴ ۵ ۶ ۷
).
The above mentioned question is tagged java but the behavior can be observed in PHP as well. With this in mind I wrote the following "test":
$string = 'I have ۳ apples and 5 oranges';
preg_match_all('/\d+/', $string, $capture);
The resulting array $capture
contains a match on 5
only.
Using the u
modifier to turn on "UTF-8 mode" and running this:
$string = 'I have ۳ apples and 5 oranges';
preg_match_all('/\d+/u', $string, $capture);
results in $capture
containing matches on both ۳
and 5
.
C
locale.Upvotes: 3
Views: 2894
Reputation: 51330
Because the documentation is broken. And it's not the only place where it is so, unfortunately.
PHP uses PCRE under the hood to implement its preg_*
functions. PCRE's documentation is thus authoritative there. PHP's documentation is based on PCRE's, but it looks like you found yet another mistake.
Here's what you can read in PCRE's docs (emphasis mine):
By default, characters with values greater than 128 do not match any of the POSIX character classes. However, if the
PCRE_UCP
option is passed topcre_compile()
, some of the classes are changed so that Unicode character properties are used. This is achieved by replacing certain POSIX classes by other sequences, as follows:[:alnum:] becomes \p{Xan} [:alpha:] becomes \p{L} [:blank:] becomes \h [:digit:] becomes \p{Nd} [:lower:] becomes \p{Ll} [:space:] becomes \p{Xps} [:upper:] becomes \p{Lu} [:word:] becomes \p{Xwd}
If you dig further in PHP's docs, you'll find the following:
u (
PCRE_UTF8
)This modifier turns on additional functionality of PCRE that is incompatible with Perl. Pattern and subject strings are treated as UTF-8. This modifier is available from PHP 4.1.0 or greater on Unix and from PHP 4.2.3 on win32. UTF-8 validity of the pattern and the subject is checked since PHP 4.3.5. An invalid subject will cause the
preg_*
function to match nothing; an invalid pattern will trigger an error of levelE_WARNING
. Five and six octet UTF-8 sequences are regarded as invalid since PHP 5.3.4 (resp. PCRE 7.3 2007-08-28); formerly those have been regarded as valid UTF-8.
This is, unfortunately, a lie. The u
modifier in PHP means PCRE_UTF8 | PCRE_UCP
(UCP stands for Unicode Character Properties). The PCRE_UCP
flag is the one that changes the meaning of \d
, \w
and the like, as you can see from the docs above. Your tests confirm that.
As a side note, don't infer properties of one regex flavor from another. It doesn't always work (heh, even this chart forgot about the PCRE_UCP
option).
Upvotes: 3