Reputation: 93
I have a string containing special characters, like:
Hello π.
As far as I understand "π" is an UTF16 character.
How can I remove this "π" character and any other not UTF8 characters from string?
The problem is that .Net and JavaScript see it as two valid UTF8 characters:
int cs_len = "π".Length; // == 2 - C#
var js_len = "π".length // == 2 - javascript
where
strIn[0] is 55356 UTF8 character == β
and
strIn[1] is 57152 UTF8 character == β
And also next code snippets returns the same result:
string strIn = "Hello π";
string res;
byte[] bytes = Encoding.UTF8.GetBytes(strIn);
res = Encoding.UTF8.GetString(bytes);
return res;//Hello π
and
string res = null;
using (var stream = new MemoryStream())
{
var sw = new StreamWriter(stream, Encoding.UTF8);
sw.Write(strIn);
sw.Flush();
stream.Position = 0;
using (var sr = new StreamReader(stream, Encoding.UTF8))
{
res = sr.ReadToEnd();
}
}
return res;//Hello π
I also need to support not only English but also Chinese and Japanese and any other languages, also any other UTF8 characters. How can I remove or replace any UTF16 characters in C# or JavaScript code, including π sign.
Thanks.
Upvotes: 1
Views: 3077
Reputation: 1
string teste = @"F:\Thiago\Programação\Projetos\OnlineAppfdsdf^~²$\XML\nfexml";
string strConteudo = Regex.Replace(teste, "[^0-9a-zA-Z\\.\\,\\/\\x20\\/\\x1F\\-\\r\\n]+", string.Empty);
WriteLine($"Teste: {teste}" +
$"\nTeste2: {strConteudo}");
Upvotes: 0
Reputation: 93
I found a solution to my question, it does not covers all the utf-16 characters, but removes many of them:
var title =
title.replace(/([\uE000-\uF8FF]|\uD83C[\uDF00-\uDFFF]|\uD83D[\uDC00-\uDDFF])/g, '*');
Here, I replace all special characters with a "star" *
. You can also put an empty string ''
to remove them.
The meaning of /g
at the end of the string, is to remove all the occurrences of these special characters, because without it string.replace(...) probably will remove only the first one.
Upvotes: 1
Reputation: 11921
UTF-16 and UTF-8 "contain" the same number of "characters" (to be precise: of code points that may represent a character, thanks to David Haim), the only difference is how they are encoded to bytes.
In your example "π" is 3C D8 40 DF
in UTF-16 and F0 9F 8D 80
in UTF-8.
From your problem-description and your pasted string I suspect that your sourcecode is encoded in UTF-8 but your compiler/interpreter is reading it as UTF-16. So it will interpret the one-character UTF-sequence F0 9F 8D 80
as two separate UTF-16-characters F0 9f
and 8D 80
- the first is an invalid unicode-character and the second is the "Han Character".
As for how to solve the issue:
In your example you should look at the editor you use for creating your sources what encoding it uses to save the files plus you should check whether you can specify that encoding as a compiler-option.
You should also be aware that things will look quite different once you don't use hardcoded string-literals but read your input from a file or over the network - you will have to handle encoding-issues already when reading your input.
Upvotes: 1