anandr
anandr

Reputation: 1672

How to remove repeating white-space characters from UTF8 string in PHP properly with regex?

I'm trying to remove repeating white-space characters from UTF8 string in PHP using regex. This regex

    $txt = preg_replace( '/\s+/i' , ' ', $txt );

usually works fine, but some of the strings have Cyrillic letter "Р", which is screwed after the replacement. After small research I realized that the letter is encoded as \x{D0A0}, and since \xA0 is non-breaking white space in ASCII the regex replaces it with \x20 and the character is no longer valid.

Any ideas how to do this properly in PHP with regex?

Upvotes: 5

Views: 4781

Answers (2)

asciimoo
asciimoo

Reputation: 661

it is described @ http://www.php.net/manual/en/function.preg-replace.php#106981

If you want to catch characters, as well european, russian, chinese, japanese, korean of whatever, just:

  • use mb_internal_encoding('UTF-8');
  • use preg_replace('...u', '...', $string) with the u (unicode) modifier

For further information, the complete list of preg_* modifiers could be found at : http://php.net/manual/en/reference.pcre.pattern.modifiers.php

Upvotes: 4

Passerby
Passerby

Reputation: 10080

Try the u modifier:

$txt="UTF 字符串 with 空格符號";
var_dump(preg_replace("/\\s+/iu","",$txt));

Outputs:

string(28) "UTF字符串with空格符號"

Upvotes: 5

Related Questions