Reputation: 83
I need a regex to match a string as follows:
[
]
[
and the ]
[
and ]
;
after the ]
. Following the ;
all characters are allowed (although sort of irrelevant since I don't care about it);
after a ]
is present, whitespaces (read tabs, spaces - although I can guarantee no \r\n\f\v
will be present, which is why I used \s
below) are allowed between the ]
and the ;
. If ;
is not present after the ]
, then ]
must be the end of the string.I ended up with the following regex which passed all my initial tests: ^\[([^]]+)](?:\s+?;)?
.
Speed is key here, so I am looking to improve on the regex that I have in order to shave off a few cycles if possible.
I'm not really sure whether the usage of a lookahead would be useful here.
EDIT
eg:
[some;thing]
- Valid, with capture group some;thing
[something]
- Valid, with capture group something
[something]
- Invalid, does not begin with [
[something] ;ojasodj
- Valid, capture group something
[something]
- Invalid, space after ]
without a ;
present
[something];
- Valid, capture group something
[]
- Invalid, must contain at least one character between [
and ]
Upvotes: 1
Views: 119
Reputation: 32276
Here's how you can do that with code instead
public static bool IsValid(string str, out string capture)
{
capture = null;
// A null string is invalid
if(str == null) return false;
// An empty string is invalid
if(str.Length == 0) return false;
// A string that does not start with [ is invalid
if(str[0] != '[') return false;
int end = str.IndexOf(']');
// A string that does not have a ] is invalid
if(end == -1) return false;
// A string that does not have anything between the [ and ] is invalid
if(end == 1) return false;
// if the ] is not the end of the string we need to look for a ;.
if(end != str.Length -1)
{
bool semicolon = false
for(int i = end + 1; i < str.Length; i++)
{
// ; found so we can stop looking at characters.
if(str[i] == ';')
{
semicolon = true;
break;
}
// If non-whitespace is between the ] and ; the string is invalid
if(!char.IsWhiteSpace(str[i])) return false;
}
// No ; found so the string is invalid
if(!semicolon) return false;
}
// Capture the string between [ and ]
capture = str.Substring(1,end - 1);
return true;
}
Obviously not as short as a regular expression, but might run faster.
Upvotes: 0
Reputation: 8413
TL;DR: ^\[([^]]+)](?:$|\s*;)
^\[([^]]+)]
is already the optimal way to match the first part of your regex, unless you can drop the capturing group. By using the negated character class you avoid any kind of unnecessary backtracking in failing cases that would be involved in any kind of .*
or .*?
pattern.
To fulfill your other rules, you need to either match the end of the string ($
) or optional spaces and a semicolon, so that should be (?:$|\s*;)
. I would put the $
first, as this is shorter match (thus quicker success), but this also depends on your data (if the second case ius the vast majority, put that first).
Full pattern being ^\[([^]]+)](?:$|\s*;)
Be aware, that $
might be followed by an optional \n
, but your testcases didn't look multiline :)
Upvotes: 2
Reputation: 37367
Try this pattern ^\[[^\]]+\](?(?=\s*;)\s*;.*|$)
Explanation:
^\[[^\]]+\]
will match text enclosed in square brackets at the beginning of the string (^
) (at least one character other than ]
inside them).
(?(?=\s*;)\s*;.*|$)
- if what follows after enclosing square bracket is only whitespaces and semicolon, then match them, otherwise assure that it's end of string ($
).
Upvotes: 0