Some intelligent guy
Some intelligent guy

Reputation: 266

PHP5.3 preg_match with Umlaute UTF-8-modifier

the following command returns true on a PHP5.3.8 Lamp(Ubuntu 11.04)-Server, but false on a PHP5.3.2 Lamp(Ubuntu 10.04.2 LTS)-Server.

<?php echo preg_match('/\w/u', 'ß'); ?>

I nearly changed all settings in the php.ini-file, but without success. I changed the system locale to en_US.UTF-8 and made it the default locale for PHP. Additionally I tried the de_DE.UTF-8-locale.

In both cases I am using the default-packages provided by ubuntu.

Does anybody has another idea, what to change, without compiling any packages, so that PHP5.3.2 will also return true?

Upvotes: 2

Views: 2403

Answers (2)

Christos Pontikis
Christos Pontikis

Reputation: 306

Unicode is not yet fully supported in php

The following code

$url='abc αβγ';
define('CONST_REGEX_SANITIZE_URL', '/[^\040\w\/\.\-\:]/u');
$invalid_url = preg_match(CONST_REGEX_SANITIZE_URL, $url) ? 'true' : 'false';
echo $invalid_url;

return 'false' with php > 5.3.10

and 'true' with php < 5.3.3 (BTW the current Debian php version)

Upvotes: 0

Gumbo
Gumbo

Reputation: 655309

PHP 5.3.2 uses PCRE 8.00 while PHP 5.3.8 uses PCRE 8.11. One change in PCRE 8.10 was the addition of the PCRE_UCP option:

PCRE_UCP

This option changes the way PCRE processes \B, \b, \D, \d, \S, \s, \W, \w, and some of the POSIX character classes. By default, only ASCII characters are recognized, but if PCRE_UCP is set, Unicode properties are used instead to classify characters. More details are given in the section on generic character types in the pcrepattern page. If you set PCRE_UCP, matching one of the items it affects takes much longer. The option is available only if PCRE has been compiled with Unicode property support.

Unfortunately, you can’t trigger this option directly with a pattern modifier in PHP. It will be set by u together with PCRE_UTF8 when available (PHP 5.3.4 and later).

Upvotes: 6

Related Questions