Reputation: 1672
I'm trying to remove repeating white-space characters from UTF8 string in PHP using regex. This regex
$txt = preg_replace( '/\s+/i' , ' ', $txt );
usually works fine, but some of the strings have Cyrillic letter "Р", which is screwed after the replacement. After small research I realized that the letter is encoded as \x{D0A0}, and since \xA0 is non-breaking white space in ASCII the regex replaces it with \x20 and the character is no longer valid.
Any ideas how to do this properly in PHP with regex?
Upvotes: 5
Views: 4781
Reputation: 661
it is described @ http://www.php.net/manual/en/function.preg-replace.php#106981
If you want to catch characters, as well european, russian, chinese, japanese, korean of whatever, just:
...
u', '...', $string) with the u (unicode) modifierFor further information, the complete list of preg_* modifiers could be found at : http://php.net/manual/en/reference.pcre.pattern.modifiers.php
Upvotes: 4
Reputation: 10080
Try the u
modifier:
$txt="UTF 字符串 with 空格符號";
var_dump(preg_replace("/\\s+/iu","",$txt));
Outputs:
string(28) "UTF字符串with空格符號"
Upvotes: 5