Reputation: 2436
I'm writing a function-parsing engine which uses regular expressions to separate the individual terms (defined as a constant or a variable followed (optionally) by an operator). It's working great, except when I have grouped terms within other grouped terms. Here's the code I'm using:
//This matches an opening delimiter
Regex openers = new Regex("[\\[\\{\\(]");
//This matches a closing delimiter
Regex closers = new Regex("[\\]\\}\\)]");
//This matches the name of a variable (\w+) or a constant numeric value (\d+(\.\d+)?)
Regex VariableOrConstant = new Regex("((\\d+(\\.\\d+)?)|\\w+)" + FunctionTerm.opRegex + "?");
//This matches the binary operators +, *, -, or /
Regex ops = new Regex("[\\*\\+\\-/]");
//This compound Regex finds a single variable or constant term (including a proceeding operator,
//if any) OR a group containing multiple terms (and their proceeding operators, if any)
//and a proceeding operator, if any.
//Matches that match this second pattern need to be added to the function as sub-functions,
//not as individual terms, to ensure the correct evalutation order with parentheses.
Regex splitter = new Regex(
openers +
"(" + VariableOrConstant + ")+" + closers + ops + "?" +
"|" +
"(" + VariableOrConstant + ")" + ops + "?");
When "splitter" is matched against the string "4/(2*X*[2+1])", the matches' values are "4/", "2*", "X*", "2+", and "1", completely ignoring all of the delimiting parentheses and braces. I believe this is because the second half of the "splitter" Regex (the part after the "|") is being matched and overriding the other part of the pattern. This is bad- I want grouped expressions to take precedence over single terms. Does anyone know how I can do this? I looked into using positive/negative lookaheads and lookbehinds, but I'm honestly not sure how to use those, or what they're even for, for that matter, and I can't find any relevant examples... Thanks in advance.
Upvotes: 8
Views: 8499
Reputation: 7705
try using difrent quantifiers
greedy:
* + ?
possessive:
*+ ++ ?+
lazy:
*? +? ??
also maybe non-capturing group:
(?:your expr here)
try try try! practice make perfect! :)
Upvotes: 3
Reputation: 75232
You didn't show us how you're applying the regex, so here's a demo I whipped up:
private static void ParseIt(string subject)
{
Console.WriteLine("subject : {0}\n", subject);
Regex openers = new Regex(@"[[{(]");
Regex closers = new Regex(@"[]})]");
Regex ops = new Regex(@"[*+/-]");
Regex VariableOrConstant = new Regex(@"((\d+(\.\d+)?)|\w+)" + ops + "?");
Regex splitter = new Regex(
openers + @"(?<FIRST>" + VariableOrConstant + @")+" + closers + ops + @"?" +
@"|" +
@"(?<SECOND>" + VariableOrConstant + @")" + ops + @"?",
RegexOptions.ExplicitCapture
);
foreach (Match m in splitter.Matches(subject))
{
foreach (string s in splitter.GetGroupNames())
{
Console.WriteLine("group {0,-8}: {1}", s, m.Groups[s]);
}
Console.WriteLine();
}
}
output:
subject : 4/(2*X*[2+1])
group 0 : 4/
group FIRST :
group SECOND : 4/
group 0 : 2*
group FIRST :
group SECOND : 2*
group 0 : X*
group FIRST :
group SECOND : X*
group 0 : [2+1]
group FIRST : 1
group SECOND :
As you can see, the term [2+1]
is matched by the first part of the regex, as you intended. It can't do anything with the (
, though, because the next bracketing character after that is another "opener" ([
), and it's looking for a "closer".
You could use .NET's "balanced matching" feature to allow for grouped terms enclosed in other groups, but it's not worth the effort. Regexes are not designed for parsing--in fact, parsing and regex matching are fundamentally different kinds of operation. And this is a good example of the difference: a regex actively seeks out matches, skipping over anything it can't use (like the open-parenthesis in your example), but a parser has to examine every character (even if it's just to decide to ignore it).
About the demo: I tried to make the minimum functional changes necessary to get your code to work (which is why I didn't correct the error of putting the +
outside the capturing group), but I also made several surface changes, and those represent active recommendations. To wit:
@"..."
) when creating regexes in C# (I think the reason is obvious).ExplicitCapture
option saves you having to clutter up the regex with (?:...)
wherever you need a non-capturing group.Finally, that whole scheme of building a large regex from a bunch of smaller regexes has very limited usefulness IMO. It's very difficult to keep track of the interactions between the parts, like which part's inside which group. Another advantage of C#'s verbatim strings is that they're multiline, so you can take advantage of free-spacing mode (a.k.a. /x
or COMMENTS mode):
Regex r = new Regex(@"
(?<GROUPED>
[[{(] # opening bracket
( # group containing:
((\d+(\.\d+)?)|\w+) # number or variable
[*+/-]? # and proceeding operator
)+ # ...one or more times
[]})] # closing bracket
[*+/-]? # and proceeding operator
)
|
(?<UNGROUPED>
((\d+(\.\d+)?)|\w+) # number or variable
[*+/-]? # and proceeding operator
)
",
RegexOptions.ExplicitCapture | RegexOptions.IgnorePatternWhitespace
);
This is not intended as a solution to your problem; as I said, that's not a job for regexes. This is just a demonstration of some useful regex techniques.
Upvotes: 5