gscriptor
gscriptor

Reputation: 93

How to remove UTF16 characters from string?

I have a string containing special characters, like:

Hello πŸ€.

As far as I understand "πŸ€" is an UTF16 character.

How can I remove this "πŸ€" character and any other not UTF8 characters from string?

The problem is that .Net and JavaScript see it as two valid UTF8 characters:

int cs_len = "πŸ€".Length; // == 2 - C#
var js_len = "πŸ€".length  // == 2 - javascript

where

strIn[0] is 55356 UTF8 character == ☐

and

strIn[1] is 57152 UTF8 character == ☐

And also next code snippets returns the same result:

string strIn = "Hello πŸ€";
string res;
byte[] bytes = Encoding.UTF8.GetBytes(strIn);
res = Encoding.UTF8.GetString(bytes);
return res;//Hello πŸ€

and

        string res = null;

        using (var stream = new MemoryStream())
        {
            var sw = new StreamWriter(stream, Encoding.UTF8);

            sw.Write(strIn);                
            sw.Flush();
            stream.Position = 0;

            using (var sr = new StreamReader(stream, Encoding.UTF8))
            {
                res = sr.ReadToEnd();
            }
        }

        return res;//Hello πŸ€

I also need to support not only English but also Chinese and Japanese and any other languages, also any other UTF8 characters. How can I remove or replace any UTF16 characters in C# or JavaScript code, including πŸ€ sign.

Thanks.

Upvotes: 1

Views: 3077

Answers (3)

string teste = @"F:\Thiago\Programação\Projetos\OnlineAppfdsdf^~²$\XML\nfexml";
        string strConteudo = Regex.Replace(teste, "[^0-9a-zA-Z\\.\\,\\/\\x20\\/\\x1F\\-\\r\\n]+", string.Empty);
       
        WriteLine($"Teste: {teste}" +
            $"\nTeste2: {strConteudo}");

Upvotes: 0

gscriptor
gscriptor

Reputation: 93

I found a solution to my question, it does not covers all the utf-16 characters, but removes many of them:

var title = 
title.replace(/([\uE000-\uF8FF]|\uD83C[\uDF00-\uDFFF]|\uD83D[\uDC00-\uDDFF])/g, '*');

Here, I replace all special characters with a "star" *. You can also put an empty string '' to remove them.

The meaning of /g at the end of the string, is to remove all the occurrences of these special characters, because without it string.replace(...) probably will remove only the first one.

Upvotes: 1

piet.t
piet.t

Reputation: 11921

UTF-16 and UTF-8 "contain" the same number of "characters" (to be precise: of code points that may represent a character, thanks to David Haim), the only difference is how they are encoded to bytes.

In your example "πŸ€" is 3C D8 40 DF in UTF-16 and F0 9F 8D 80 in UTF-8.

From your problem-description and your pasted string I suspect that your sourcecode is encoded in UTF-8 but your compiler/interpreter is reading it as UTF-16. So it will interpret the one-character UTF-sequence F0 9F 8D 80 as two separate UTF-16-characters F0 9f and 8D 80 - the first is an invalid unicode-character and the second is the "Han Character".

As for how to solve the issue:

In your example you should look at the editor you use for creating your sources what encoding it uses to save the files plus you should check whether you can specify that encoding as a compiler-option.

You should also be aware that things will look quite different once you don't use hardcoded string-literals but read your input from a file or over the network - you will have to handle encoding-issues already when reading your input.

Upvotes: 1

Related Questions