bdcoder
bdcoder

Reputation: 3781

RegEx to remove all control / invisible characters EXCEPT CR or LF

I am trying to use regex (.Net) to "sanitize" a Unicode input string -- the requirement is to remove all invisible characters / control characters EXCEPT CR (carriage returns) and LF (linefeeds). In other words, keep all valid printable characters (English and French) including CR and LF.

I have tried the following (just using the underscore to see what was replaced), but it also removes CR / LF ...

clean_str = Regex.Replace( in_str, "\p{C}+", "_" )

Also tried:

clean_str = Regex.Replace( in_str, "(\p{Cf}|\p{Co}|\p{Cs}|\p{Cn}|[\x00-\x09]|\x0b|\x0c|[\x0e-\x1f]|\x7f)+", "_" )

From http://www.regular-expressions.info/unicode.html ...

p{C} or \p{Other}: invisible control characters and unused code points.

 ◦\p{Cc} or \p{Control}: an ASCII 0x00–0x1F or Latin-1 0x80–0x9F control character.
 ◦\p{Cf} or \p{Format}: invisible formatting indicator.
 ◦\p{Co} or \p{Private_Use}: any code point reserved for private use.
 ◦\p{Cs} or \p{Surrogate}: one half of a surrogate pair in UTF-16 encoding.
 ◦\p{Cn} or \p{Unassigned}: any code point to which no character has been assigned.

Guru's - if you have a better / more efficient way -- please post !

Thanks in advance!

Upvotes: 2

Views: 3364

Answers (2)

robert
robert

Reputation: 4867

As an alternative to using RegEx, you could just iterate the string:

public string Clean(string dirty)
{
    var clean = new StringBuilder();
 
    const char SPACE = ' ';
    const char LF = '\n';
    const char CR = '\r';
    const char DEL = (char)127;

    foreach (var c in dirty)
    {
        switch (c)
        {
            case CR or LF:
                clean.Append(c);
                break;

            case <= SPACE or DEL:
                continue;

            default:
                clean.Append(c);
                break;
        }
    }

    return clean.ToString();
}

Upvotes: 3

Ben Grimm
Ben Grimm

Reputation: 4371

You can use character class subtraction to exclude CR and LF from the control character class:

clean_str = Regex.Replace( in_str, "[\p{C}-[\r\n]]+", "" )

Upvotes: 4

Related Questions