user7150219
user7150219

Reputation:

C# regex to remove non - printable characters, and control characters, in a text that has a mix of many different languages, unicode letters

i would appreciate your help on this, since i do not know which range of characters to use, or if there is a character class like [[:cntrl:]] that i have found in ruby?

by means of non printable, i mean delete all characters that are not shown in ie output, when one prints the input string. Please note, i look for a c# regex, i do not have a problem with my code

Upvotes: 38

Views: 40936

Answers (5)

jeb
jeb

Reputation: 82420

Instead of removing non-printable characters you could replace them with \x<ASCII-Value>

var formattedText = Regex.Replace(text, @"\p{C}+",
            match => string.Join("",
                match.Value.ToCharArray().
                Select(ch => $"\\x{(int)ch:D2}")));

This results for a string var text = "Bel\x07\n\rOr\t TAB" to Bel\x07\x0A\x0DOr\x09 TAB

Upvotes: 0

Amarildo Lena
Amarildo Lena

Reputation: 64

you can try this:

    public static string TrimNonAscii(this string value)
    {
        string pattern = "[^ -~]*";
        Regex reg_exp = new Regex(pattern);
        return reg_exp.Replace(value, "");
    }

Upvotes: 2

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 627488

You may remove all control and other non-printable characters with

s = Regex.Replace(s, @"\p{C}+", string.Empty);

The \p{C} Unicode category class matches all control characters, even those outside the ASCII table because in .NET, Unicode category classes are Unicode-aware by default.

Breaking it down into subcategories

  • To only match basic control characters you may use \p{Cc}+, see 65 chars in the Other, Control Unicode category. It is equal to a [\u0000-\u0008\u000E-\u001F\u007F-\u0084\u0086-\u009F \u0009-\u000D \u0085]+ regex.
  • To only match 161 other format chars including the well-known soft hyphen (\u00AD), zero-width space (\u200B), zero-width non-joiner (\u200C), zero-width joiner (\u200D), left-to-right mark (\u200E) and right-to-left mark (\u200F) use \p{Cf}+. The equivalent including astral place code points is a (?:[\xAD\u0600-\u0605\u061C\u06DD\u070F\u08E2\u180E\u200B-\u200F\u202A-\u202E\u2060-\u2064\u2066-\u206F\uFEFF\uFFF9-\uFFFB]|\uD804[\uDCBD\uDCCD]|\uD80D[\uDC30-\uDC38]|\uD82F[\uDCA0-\uDCA3]|\uD834[\uDD73-\uDD7A]|\uDB40[\uDC01\uDC20-\uDC7F])+ regex.
  • To match 137,468 Other, Private Use control code points you may use \p{Co}+, or its equivalent including astral place code points, (?:[\uE000-\uF8FF]|[\uDB80-\uDBBE\uDBC0-\uDBFE][\uDC00-\uDFFF]|[\uDBBF\uDBFF][\uDC00-\uDFFD])+.
  • To match 2,048 Other, Surrogate code points that include some emojis, you may use \p{Cs}+, or [\uD800-\uDFFF]+ regex.

Upvotes: 106

Nerdroid
Nerdroid

Reputation: 13996

To remove all control and other non-printable characters

Regex.Replace(s, @"\p{C}+", String.Empty);

To remove the control characters only (if you don't want to remove the emojis 😎)

Regex.Replace(s, @"\p{Cc}+", String.Empty);

Upvotes: 2

Yanga
Yanga

Reputation: 3012

You can try with :

string s = "Täkörgåsmrgås";
s = Regex.Replace(s, @"[^\u0000-\u007F]+", string.Empty);


Updated answer after comments:

Documentation about non-printable character: https://en.wikipedia.org/wiki/Control_character

Char.IsControl Method:

https://msdn.microsoft.com/en-us/library/system.char.iscontrol.aspx

Maybe you can try:

string input; // this is your input string
string output = new string(input.Where(c => !char.IsControl(c)).ToArray());

Upvotes: 6

Related Questions