user3740011
user3740011

Reputation: 43

PHP correct regex expression doesn't work in PHP 5.3.3-7 with unicode

I'm trying to get this regex to work which is intended for finding both two words in a sentence.

echo (int)preg_match('/\bHello\W+(?:\w+\W+){0,6}?World\b/ui', 'Hello, world!', $matches).PHP_EOL;
print_r($matches);

And it works perfectly:

1
Array
(
    [0] => Hello, world
)

... but only with latin words. If I'm switching to unicode, it doesn't find anything. There is also no need to look on the syntax because it's from a book (chapter 8. "Find Two Words Near Each Other"). The problem is that it works for latin words only but not for unicode strings like this: 'Привіт, світу!' (in Ukrainian).

And I checked out almost every possible problem:

✓ I'm using the 'u' flag in the regex pattern.

✓ I'm enabling UTF-8 support in the code before executing this statement like this:

 ini_set('default_charset', 'UTF-8');
 mb_internal_encoding('UTF-8');
 mb_regex_encoding('UTF-8');

✓ My PCRE on Debian Linux is compiled correctly:

 # pcretest -C
 PCRE version 8.02 2010-03-19
 Compiled with
   UTF-8 support
   Unicode properties support
   Newline sequence is LF
   \R matches all Unicode newlines
   Internal link size = 2
   POSIX malloc threshold = 10
   Default match limit = 10000000
   Default recursion depth limit = 10000000
   Match recursion uses stack

✓ I even tried adding this weird sequence (*UTF8) to the pattern according to this answer here but it didn't help:

echo (int)preg_match('/(*UTF8)\bПривіт\W+(?:\w+\W+){0,6}?світу\b/ui', 'Привіт, світу!', $matches).PHP_EOL;
print_r($matches);

The result:

0
Array
(
)

So my question is: why is unicode not working here when it's perfectly working for other unicode patterns I'm using in the same code? They are a bit simpler though, like this:

echo (int)preg_match('/Привіт/ui', 'Привіт, світу!', $matches).PHP_EOL;
print_r($matches);

This surprisingly works:

1
Array
(
    [0] => Привіт
)

And finally funny enough it totally works fine on this online regex tester (that's why I'm so frustrated actually, I tested it and then expected to work in my code too, but it doesn't).

Oh the wise Stackoverflow, please give he a hint.

Upvotes: 4

Views: 843

Answers (1)

Gas Welder
Gas Welder

Reputation: 560

I had a similar problem once and discovered that UTF-8 symbols inside patterns are not working on some versions of PHP. Even 5.3 version, which was current then, had this problem. Check out your example here: http://3v4l.org/7HurJ. According to that test, you have to have at least 5.3.4 to have that pattern working, but I think, version number doesn't really mean much here. Maybe, it actually depends on some compile option, or maybe there is a workaround, but I didn't dig deeper and simply adjusted my approach not to use any "funny" symbols in expressions.

Upvotes: 1

Related Questions