Reputation: 43
I'm trying to get this regex to work which is intended for finding both two words in a sentence.
echo (int)preg_match('/\bHello\W+(?:\w+\W+){0,6}?World\b/ui', 'Hello, world!', $matches).PHP_EOL;
print_r($matches);
And it works perfectly:
1
Array
(
[0] => Hello, world
)
... but only with latin words. If I'm switching to unicode, it doesn't find anything. There is also no need to look on the syntax because it's from a book (chapter 8. "Find Two Words Near Each Other"). The problem is that it works for latin words only but not for unicode strings like this: 'Привіт, світу!' (in Ukrainian).
And I checked out almost every possible problem:
✓ I'm using the 'u' flag in the regex pattern.
✓ I'm enabling UTF-8 support in the code before executing this statement like this:
ini_set('default_charset', 'UTF-8');
mb_internal_encoding('UTF-8');
mb_regex_encoding('UTF-8');
✓ My PCRE on Debian Linux is compiled correctly:
# pcretest -C
PCRE version 8.02 2010-03-19
Compiled with
UTF-8 support
Unicode properties support
Newline sequence is LF
\R matches all Unicode newlines
Internal link size = 2
POSIX malloc threshold = 10
Default match limit = 10000000
Default recursion depth limit = 10000000
Match recursion uses stack
✓ I even tried adding this weird sequence (*UTF8) to the pattern according to this answer here but it didn't help:
echo (int)preg_match('/(*UTF8)\bПривіт\W+(?:\w+\W+){0,6}?світу\b/ui', 'Привіт, світу!', $matches).PHP_EOL; print_r($matches);
The result:
0
Array
(
)
So my question is: why is unicode not working here when it's perfectly working for other unicode patterns I'm using in the same code? They are a bit simpler though, like this:
echo (int)preg_match('/Привіт/ui', 'Привіт, світу!', $matches).PHP_EOL;
print_r($matches);
This surprisingly works:
1
Array
(
[0] => Привіт
)
And finally funny enough it totally works fine on this online regex tester (that's why I'm so frustrated actually, I tested it and then expected to work in my code too, but it doesn't).
Oh the wise Stackoverflow, please give he a hint.
Upvotes: 4
Views: 843
Reputation: 560
I had a similar problem once and discovered that UTF-8 symbols inside patterns are not working on some versions of PHP. Even 5.3 version, which was current then, had this problem. Check out your example here: http://3v4l.org/7HurJ. According to that test, you have to have at least 5.3.4 to have that pattern working, but I think, version number doesn't really mean much here. Maybe, it actually depends on some compile option, or maybe there is a workaround, but I didn't dig deeper and simply adjusted my approach not to use any "funny" symbols in expressions.
Upvotes: 1