JonU
JonU

Reputation: 83

Optimise the following regex

I need a regex to match a string as follows:

I ended up with the following regex which passed all my initial tests: ^\[([^]]+)](?:\s+?;)?.

Speed is key here, so I am looking to improve on the regex that I have in order to shave off a few cycles if possible.

I'm not really sure whether the usage of a lookahead would be useful here.

EDIT

eg:

[some;thing] - Valid, with capture group some;thing

[something] - Valid, with capture group something

[something] - Invalid, does not begin with [

[something] ;ojasodj - Valid, capture group something

[something] - Invalid, space after ] without a ; present

[something]; - Valid, capture group something

[] - Invalid, must contain at least one character between [ and ]

Upvotes: 1

Views: 119

Answers (3)

juharr
juharr

Reputation: 32276

Here's how you can do that with code instead

public static bool IsValid(string str, out string capture)
{
    capture = null;

    // A null string is invalid
    if(str == null) return false;

    // An empty string is invalid
    if(str.Length == 0) return false;

    // A string that does not start with [ is invalid
    if(str[0] != '[') return false;
    int end = str.IndexOf(']');

    // A string that does not have a ] is invalid
    if(end == -1) return false;

    // A string that does not have anything between the [ and ] is invalid
    if(end == 1) return false;

    // if the ] is not the end of the string we need to look for a ;.
    if(end != str.Length -1)
    {
        bool semicolon = false
        for(int i = end + 1; i < str.Length; i++)
        {
            // ; found so we can stop looking at characters.
            if(str[i] == ';') 
            {
                semicolon = true;
                break;
            }

            // If non-whitespace is between the ] and ; the string is invalid
            if(!char.IsWhiteSpace(str[i])) return false;
        }

        // No ; found so the string is invalid
        if(!semicolon) return false;
    }

    // Capture the string between [ and ]
    capture = str.Substring(1,end - 1);
    return true;
}

Obviously not as short as a regular expression, but might run faster.

Upvotes: 0

Sebastian Proske
Sebastian Proske

Reputation: 8413

TL;DR: ^\[([^]]+)](?:$|\s*;)

^\[([^]]+)] is already the optimal way to match the first part of your regex, unless you can drop the capturing group. By using the negated character class you avoid any kind of unnecessary backtracking in failing cases that would be involved in any kind of .* or .*? pattern.

To fulfill your other rules, you need to either match the end of the string ($) or optional spaces and a semicolon, so that should be (?:$|\s*;). I would put the $ first, as this is shorter match (thus quicker success), but this also depends on your data (if the second case ius the vast majority, put that first).

Full pattern being ^\[([^]]+)](?:$|\s*;)

Be aware, that $might be followed by an optional \n, but your testcases didn't look multiline :)

Upvotes: 2

Michał Turczyn
Michał Turczyn

Reputation: 37367

Try this pattern ^\[[^\]]+\](?(?=\s*;)\s*;.*|$)

Explanation:

^\[[^\]]+\] will match text enclosed in square brackets at the beginning of the string (^) (at least one character other than ] inside them).

(?(?=\s*;)\s*;.*|$) - if what follows after enclosing square bracket is only whitespaces and semicolon, then match them, otherwise assure that it's end of string ($).

Demo

Upvotes: 0

Related Questions