Reputation: 3067
Despite https://www.php.net/manual/en/reference.pcre.pattern.modifiers.php not mentioning it at all, PCRE doesn't seem to work correctly with utf8 strings prior to PHP 5.3.4 even with the 'u' modifier (which is supposed to enable support for utf8 and which according to the abovementioned documentation is available even since PHP 4.something)
preg_split("/\W+/u", $someUtf8String)
will work as expected on PHP 5.3.4 and above, but will break the string on characters such as ó ò ú í ì and the like, as if they were non-word, on older versions
See: http://3v4l.org/ERDp5 and if you have doubts (as I do have) about whether or not the string is actually utf8-encoded you can try: http://3v4l.org/6XnOj http://3v4l.org/mak33
Either there was a bug which was fixed only in 5.3.4, or utf8 was not supported (in which case I wonder why the 'u' modifier is available at all)
The question is: is there a workaround for older PHP versions? I need to have \W work correctly on a utf8 string on PHP 5.1.6
Upvotes: 1
Views: 671
Reputation: 48721
How about mb_split
?
mb_split("\W+", "histórica");
Notice: Without delimiters
Upvotes: 3