remove 4 byte UTF8 characters

Question

I'd like to remove 4 byte UTF8 characters which starts with \xF0 (the char with the ASCII code 0xF0) from a string and tried

sText = Regex.Replace (sText, "\xF0...", "");

This doesn't work. Using two backslashes did not work neither.

The exact input is the content of https://de.wikipedia.org/w/index.php?title=Spezial:Exportieren&action=submit&pages=Unicode The 4 byte character ist the one after the text "[[Violinschlüssel]] ", in hex notation: .. 0x65 0x6c 0x5d 0x5d 0x20 0xf0 0x9d 0x84 0x9e 0x20 .. The expected output is 0x65 0x6c 0x5d 0x5d 0x20 0x20 ..

What's wrong?

Jeppe Stig Nielsen · Accepted Answer

Such characters will be surrogate pairs in .NET which uses UTF-16. Each of them will be two UTF-16 code units, that is two char values.

To just remove them, you can do (using System.Linq;):

sText = string.Concat(sText.Where(x => !char.IsSurrogate(x)));

(uses an overload of Concat introduced in .NET 4.0 (Visual Studio 2010)).

Late addition: It may give better performance to use:

sText = new string(sText.Where(x => !char.IsSurrogate(x)).ToArray());

even if it looks worse. (Works in .NET 3.5 (Visual Studio 2008).)

Answers (2)