Reputation: 444
I'd like to remove 4 byte UTF8 characters which starts with \xF0 (the char with the ASCII code 0xF0) from a string and tried
sText = Regex.Replace (sText, "\xF0...", "");
This doesn't work. Using two backslashes did not work neither.
The exact input is the content of https://de.wikipedia.org/w/index.php?title=Spezial:Exportieren&action=submit&pages=Unicode The 4 byte character ist the one after the text "[[Violinschlüssel]] ", in hex notation: .. 0x65 0x6c 0x5d 0x5d 0x20 0xf0 0x9d 0x84 0x9e 0x20 .. The expected output is 0x65 0x6c 0x5d 0x5d 0x20 0x20 ..
What's wrong?
Upvotes: 4
Views: 3235
Reputation: 61952
Such characters will be surrogate pairs in .NET which uses UTF-16. Each of them will be two UTF-16 code units, that is two char
values.
To just remove them, you can do (using System.Linq;
):
sText = string.Concat(sText.Where(x => !char.IsSurrogate(x)));
(uses an overload of Concat
introduced in .NET 4.0 (Visual Studio 2010)).
Late addition: It may give better performance to use:
sText = new string(sText.Where(x => !char.IsSurrogate(x)).ToArray());
even if it looks worse. (Works in .NET 3.5 (Visual Studio 2008).)
Upvotes: 5
Reputation: 14038
You are trying to search for byte
values but C# strings are made from char
values. The C# language spec at section "2.4.4.4 Character literals" states:
A character literal represents a single character, and usually consists of a character in quotes, as in 'a'.
...
A hexadecimal escape sequence represents a single Unicode character, with the value formed by the hexadecimal number following\x
.
Hence the search for "\xF0..."
is searching for the character U+F0
which would be represented by the bytes C3 B0
.
If you want find replace all Unicode characters whose first byte is 0xF0 then I believe you need to search for the character values whose first byte if 0xFO.
The character U+10000
is represented as F0 90 80 80
(the preceding code is U+FFFF
which is EF BF BF
). The first code with F1 .... ..
is U+40000
which is F1 80 80 80
and the value before it is U+3FFFF
which is F0 BF BF BF
.
Hence you need to remove characters in the range U+10000
to U+3FFFF
. This should be possible with a regular expression such as
sText = Regex.Replace (sText, "[\\x10000-\\x3FFFF]", "");
The relevant characters from the source quoted in the question have been extracted into the code below. The code then tries to understand how the characters are held in strings.
static void Main(string[] args)
{
string input = "] 𝄞 (";
Console.Write("Input length {0} : '{1}' : ", input.Length, input);
foreach (char cc in input)
{
Console.Write(" {0,2:X02}", (int)cc);
}
Console.WriteLine();
}
The output from the program is as below. This supports the surrogate pair explanation given by @Jeppe in his answer.
Input length 6 : '] ?? (' : 5D 20 D834 DD1E 20 28
Upvotes: 2