How to remove repeating white-space characters from UTF8 string in PHP properly with regex?

Question

I'm trying to remove repeating white-space characters from UTF8 string in PHP using regex. This regex

    $txt = preg_replace( '/\s+/i' , ' ', $txt );

usually works fine, but some of the strings have Cyrillic letter "Р", which is screwed after the replacement. After small research I realized that the letter is encoded as \x{D0A0}, and since \xA0 is non-breaking white space in ASCII the regex replaces it with \x20 and the character is no longer valid.

Any ideas how to do this properly in PHP with regex?

asciimoo · Accepted Answer

it is described @ http://www.php.net/manual/en/function.preg-replace.php#106981

If you want to catch characters, as well european, russian, chinese, japanese, korean of whatever, just:

use mb_internal_encoding('UTF-8');
use preg_replace('...u', '...', $string) with the u (unicode) modifier

For further information, the complete list of preg_* modifiers could be found at : http://php.net/manual/en/reference.pcre.pattern.modifiers.php

How to remove repeating white-space characters from UTF8 string in PHP properly with regex?

Answers (2)

Related Questions