Reputation: 55

Unicode characters replace from string using C#

string str = "our guests will experience \u001favor in an area";
 bool exists = str.IndexOf("\u001", StringComparison.CurrentCultureIgnoreCase) > -1;

I want to find and replace this characters \u001 in string.I tried hardly to resolve but still helpless.

Please Resolve this issue. Thanks in advance for your precious help.

Upvotes: 0

Answers (3)

Vladyslav Kurkotov

Reputation: 505

Use regexp:

var unicodeRegexp = new Regex(@"\x1f");
var testWord = "our guests will experience \u001favor in an area";
var newWord = unicodeRegexp.Replace(testWord, "text for replacement");

\x1f is the replacement for \uoo1f, leading zeros should be skipped https://www.regular-expressions.info/unicode.html#codepoint

Upvotes: 0

Andrew Morton

Reputation: 25047

If we look at the C# language specification, ECMA-334, in section 7.4.2 "Unicode character escape sequences", we find

A Unicode escape sequence represents a Unicode code point. Unicode escape sequences are processed in identifiers (§7.4.3), character literals (§7.4.5.5), and regular string literals (§7.4.5.6). A Unicode escape sequence is not processed in any other location (for example, to form an operator, punctuator, or keyword).

unicode-escape-sequence:: \u hex-digit hex-digit hex-digit hex-digit
\U hex-digit hex-digit hex-digit hex-digit hex-digit hex-digit hex-digit hex-digit

So you have to use four hex digits with the \u.

In your example, it takes "001f" as those four hex digits.

The "\u001" in your example should have given an error in Visual Studio along the lines of "Unrecognized escape sequence."

Upvotes: 0

Paweł Dyl

Reputation: 9143

Somewhere, deep inside C# specification, you can find following:

[Note: The use of the \x hexadecimal-escape-sequence production can be error-prone and hard to read due to the variable number of hexadecimal digits following the \x. For example, in the code:

string good = "\x9Good text";

string bad = "\x9Bad text";

it might appear at first that the leading character is the same (U+0009, a tab character) in both strings. In fact the second string starts with U+9BAD as all three letters in the word "Bad" are valid hexadecimal digits. As a matter of style, it is recommended that \x is avoided in favour of either specific escape sequences (\t in this example) or the fixed-length \u escape sequence. end note]

And also:

unicode-escape-sequence::

\u hex-digit hex-digit hex-digit hex-digit

\U hex-digit hex-digit hex-digit hex-digit hex-digit hex-digit hex-digit hex-digit

To further simplify, \u is followed by 4 or 8 hex symbols - not 3. Your string is interpreted as "our guests will experience \u001favor in an area".

Upvotes: 2

Unicode characters replace from string using C#

Answers (3)

Related Questions