Joe.wang
Joe.wang

Reputation: 11791

How to remove the unicode character in the string

Say we have a string like below.

string s = "此检查项己被你忽略,请联系医生。\u2028内科";

How can I remove the unicode character like \u2028 in the string ?

I had tried the below function. Unfortunately they all doesn't work. Please save me. Thanks.

Unicode characters string

Convert a Unicode string to an escaped ASCII string

Replace unicode escape sequences in a string

Updated

Why the below code doesn't work for me ?

enter image description here

Updated I tried to display the string in the output. It is a line separator.

enter image description here

Upvotes: 2

Views: 3983

Answers (1)

41686d6564
41686d6564

Reputation: 19641

As noted by @spender in the comments above:

The fundamental premise of your question (removing unicode) is broken, because all strings are stored as unicode in memory. All the characters are unicode.

However, if you have a non-escaped string in the format "\uXXXX" which you'd like to replace/remove, you can use something like this regex pattern: @"\\u[0-9A-Fa-f]{4}"

Here's a complete example:

string noUnicode = "此检查项己被你忽略,请联系医生。内科";

// If you hard-code the string, you MUST add an `@` before the string, otherwise,
// the "u2028" will get escaped and converted to its corresponding Unicode character.
string s = @"此检查项己被你忽略,请联系医生。\u2028内科";
string ss = Regex.Replace(s, @"\\u[0-9A-Fa-f]{4}", string.Empty);

Debug.Print("s = " + s);
Debug.Print("ss = " + ss);

Debug.Print((ss == noUnicode).ToString());

Here's a fiddle to test, and here's its output:

Fiddle

Note: Since the string is hard-coded, you have to use an @ here to prevent the sub-string "\u2028" from being converted to the corresponding Unicode char. On the other hand, if you get the original string from somewhere else (e.g., read from a text file), the sub-string "\u2028" is already represented as is, there should be no problem, and the above code should work just fine.

So, something like this would work exactly the same:

string s = File.ReadAllText(@"Path\to\a\Unicode\text\file\containing\the\string\'\u2028'");
string ss = Regex.Replace(s, @"\\u[0-9A-Fa-f]{4}", string.Empty);

Upvotes: 2

Related Questions