Reputation: 2297
I'm working on a calculator and it takes string expressions and evaluates them. I have a function that searches the expression for math functions using Regex, retrieves the arguments, looks up the function name, and evaluates it. What I'm having problem with is that I can only do this if I know how many arguments there are going to be, I can't get the Regex right. And if I just split the contents of the (
and )
characters by the ,
character then I can't have other function calls in that argument.
Here is the function matching pattern: \b([a-z][a-z0-9_]*)\((..*)\)\b
It only works with one argument, have can I create a group for every argument excluding the ones inside of nested functions? For example, it would match: func1(2 * 7, func2(3, 5))
and create capture groups for: 2 * 7
and func2(3, 5)
Here the function I'm using to evaluate the expression:
/// <summary>
/// Attempts to evaluate and store the result of the given mathematical expression.
/// </summary>
public static bool Evaluate(string expr, ref double result)
{
expr = expr.ToLower();
try
{
// Matches for result identifiers, constants/variables objects, and functions.
MatchCollection results = Calculator.PatternResult.Matches(expr);
MatchCollection objs = Calculator.PatternObjId.Matches(expr);
MatchCollection funcs = Calculator.PatternFunc.Matches(expr);
// Parse the expression for functions.
foreach (Match match in funcs)
{
System.Windows.Forms.MessageBox.Show("Function found. - " + match.Groups[1].Value + "(" + match.Groups[2].Value + ")");
int argCount = 0;
List<string> args = new List<string>();
List<double> argVals = new List<double>();
string funcName = match.Groups[1].Value;
// Ensure the function exists.
if (_Functions.ContainsKey(funcName)) {
argCount = _Functions[funcName].ArgCount;
} else {
Error("The function '"+funcName+"' does not exist.");
return false;
}
// Create the pattern for matching arguments.
string argPattTmp = funcName + "\\(\\s*";
for (int i = 0; i < argCount; ++i)
argPattTmp += "(..*)" + ((i == argCount - 1) ? ",":"") + "\\s*";
argPattTmp += "\\)";
// Get all of the argument strings.
Regex argPatt = new Regex(argPattTmp);
// Evaluate and store all argument values.
foreach (Group argMatch in argPatt.Matches(match.Value.Trim())[0].Groups)
{
string arg = argMatch.Value.Trim();
System.Windows.Forms.MessageBox.Show(arg);
if (arg.Length > 0)
{
double argVal = 0;
// Check if the argument is a double or expression.
try {
argVal = Convert.ToDouble(arg);
} catch {
// Attempt to evaluate the arguments expression.
System.Windows.Forms.MessageBox.Show("Argument is an expression: " + arg);
if (!Evaluate(arg, ref argVal)) {
Error("Invalid arguments were passed to the function '" + funcName + "'.");
return false;
}
}
// Store the value of the argument.
System.Windows.Forms.MessageBox.Show("ArgVal = " + argVal.ToString());
argVals.Add(argVal);
}
else
{
Error("Invalid arguments were passed to the function '" + funcName + "'.");
return false;
}
}
// Parse the function and replace with the result.
double funcResult = RunFunction(funcName, argVals.ToArray());
expr = new Regex("\\b"+match.Value+"\\b").Replace(expr, funcResult.ToString());
}
// Final evaluation.
result = Program.Scripting.Eval(expr);
}
catch (Exception ex)
{
Error(ex.Message);
return false;
}
return true;
}
////////////////////////////////// ---- PATTERNS ---- \\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
/// <summary>
/// The pattern used for function calls.
/// </summary>
public static Regex PatternFunc = new Regex(@"([a-z][a-z0-9_]*)\((..*)\)");
As you can see, there is a pretty bad attempt at building a Regex to match the arguments. It doesn't work.
All I am trying to do is extract 2 * 7
and func2(3, 5)
from the expression func1(2 * 7, func2(3, 5))
but it must work for functions with different argument counts as well. If there is a way to do this without using Regex that is also good.
Upvotes: 19
Views: 48048
Reputation: 51
This regex does what you want:
^(?<FunctionName>\w+)\((?>(?(param),)(?<param>(?>(?>[^\(\),"]|(?<p>\()|(?<-p>\))|(?(p)[^\(\)]|(?!))|(?(g)(?:""|[^"]|(?<-g>"))|(?!))|(?<g>")))*))+\)$
Don't forget to escape backslashes and double quotes when pasting it in your code.
It will match correctly arguments in double quotes, inner functions and numbers like this one:
f1(123,"df""j"" , dhf",abc12,func2(),func(123,a>2))
The param stack will contains
123
"df""j"" , dhf"
abc12
func2()
func(123,a>2)
Upvotes: 5
Reputation: 17272
There is both a simple solution and a more advanced solution (added after edit) to handle more complex functions.
To achieve the example you posted, I suggest doing this in two steps, the first step is to extract the parameters (regexes are explained at the end):
\b[^()]+\((.*)\)$
Now, to parse the parameters.
Simple solution
Extract the parameters using:
([^,]+\(.+?\))|([^,]+)
Here are some C# code examples (all asserts pass):
string extractFuncRegex = @"\b[^()]+\((.*)\)$";
string extractArgsRegex = @"([^,]+\(.+?\))|([^,]+)";
//Your test string
string test = @"func1(2 * 7, func2(3, 5))";
var match = Regex.Match( test, extractFuncRegex );
string innerArgs = match.Groups[1].Value;
Assert.AreEqual( innerArgs, @"2 * 7, func2(3, 5)" );
var matches = Regex.Matches( innerArgs, extractArgsRegex );
Assert.AreEqual( matches[0].Value, "2 * 7" );
Assert.AreEqual( matches[1].Value.Trim(), "func2(3, 5)" );
Explanation of regexes. The arguments extraction as a single string:
\b[^()]+\((.*)\)$
where:
[^()]+
chars that are not an opening, closing bracket. \((.*)\)
everything inside the bracketsThe args extraction:
([^,]+\(.+?\))|([^,]+)
where:
([^,]+\(.+?\))
character that are not commas followed by characters in brackets. This picks up the func arguments. Note the +? so that the match is lazy and stops at the first ) it meets. |([^,]+)
If the previous does not match then match consecutive chars that are not commas. These matches go into groups.More advanced solution
Now, there are some obvious limitations with that approach, for example it matches the first closing bracket, so it doesn't handle nested functions very well. For a more comprehensive solution (if you require it), we need to use balancing group definitions(as I mentioned before this edit). For our purposes, balancing group definitions allow us to keep track of the instances of the open brackets and subtract the closing bracket instances. In essence opening and closing brackets will cancel each other out in the balancing part of the search until the final closing bracket is found. That is, the match will continue until the brackets balance and the final closing bracket is found.
So, the regex to extract the parms is now (func extraction can stay the same):
(?:[^,()]+((?:\((?>[^()]+|\((?<open>)|\)(?<-open>))*\)))*)+
Here are some test cases to show it in action:
string extractFuncRegex = @"\b[^()]+\((.*)\)$";
string extractArgsRegex = @"(?:[^,()]+((?:\((?>[^()]+|\((?<open>)|\)(?<-open>))*\)))*)+";
//Your test string
string test = @"func1(2 * 7, func2(3, 5))";
var match = Regex.Match( test, extractFuncRegex );
string innerArgs = match.Groups[1].Value;
Assert.AreEqual( innerArgs, @"2 * 7, func2(3, 5)" );
var matches = Regex.Matches( innerArgs, extractArgsRegex );
Assert.AreEqual( matches[0].Value, "2 * 7" );
Assert.AreEqual( matches[1].Value.Trim(), "func2(3, 5)" );
//A more advanced test string
test = @"someFunc(a,b,func1(a,b+c),func2(a*b,func3(a+b,c)),func4(e)+func5(f),func6(func7(g,h)+func8(i,(a)=>a+2)),g+2)";
match = Regex.Match( test, extractFuncRegex );
innerArgs = match.Groups[1].Value;
Assert.AreEqual( innerArgs, @"a,b,func1(a,b+c),func2(a*b,func3(a+b,c)),func4(e)+func5(f),func6(func7(g,h)+func8(i,(a)=>a+2)),g+2" );
matches = Regex.Matches( innerArgs, extractArgsRegex );
Assert.AreEqual( matches[0].Value, "a" );
Assert.AreEqual( matches[1].Value.Trim(), "b" );
Assert.AreEqual( matches[2].Value.Trim(), "func1(a,b+c)" );
Assert.AreEqual( matches[3].Value.Trim(), "func2(a*b,func3(a+b,c))" );
Assert.AreEqual( matches[4].Value.Trim(), "func4(e)+func5(f)" );
Assert.AreEqual( matches[5].Value.Trim(), "func6(func7(g,h)+func8(i,(a)=>a+2))" );
Assert.AreEqual( matches[6].Value.Trim(), "g+2" );
Note especially that the method is now quite advanced:
someFunc(a,b,func1(a,b+c),func2(a*b,func3(a+b,c)),func4(e)+func5(f),func6(func7(g,h)+func8(i,(a)=>a+2)),g+2)
So, looking at the regex again:
(?:[^,()]+((?:\((?>[^()]+|\((?<open>)|\)(?<-open>))*\)))*)+
In summary, it starts out with characters that are not commas or brackets. Then if there are brackets in the argument, it matches and subtracts the brackets until they balance. It then tries to repeat that match in case there are other functions in the argument. It then goes onto the next argument (after the comma). In detail:
[^,()]+
matches anything that is not ',()' ?:
means non-capturing group, i.e. do not store matches within brackets in a group.\(
means start at an open bracket. ?>
means atomic grouping - essentially, this means it does not remember backtracking positions. This also helps to improve performance because there are less stepbacks to try different combinations. [^()]+|
means anything but an opening or closing bracket. This is followed by | (or)\((?<open>)|
This is the good stuff and says match '(' or (?<-open>)
This is the better stuff that says match a ')' and balance out the '('. This means that this part of the match (everything after the first bracket) will continue until all the internal brackets match. Without the balancing expressions, the match would finish on the first closing bracket. The crux is that the engine does not match this ')' against the final ')', instead it is subtracted from the matching '('. When there are no further outstanding '(', the -open fails so the final ')' can be matched.One final embellishment:
If you add (?(open)(?!))
to the regex:
(?:[^,()]+((?:\((?>[^()]+|\((?<open>)|\)(?<-open>))*(?(open)(?!))\)))*)+
The (?!) will always fail if open has captured something (that hasn't been subtracted), i.e. it will always fail if there is an opening bracket without a closing bracket. This is a useful way to test whether the balancing has failed.
Some notes:
Hope that helps.
Upvotes: 37
Reputation: 10136
There are some new (relatively very new) language-specific enhancements to regex that make it possible to match context free languages with "regex", but you will find more resources and more help when using the tools more commonly used for this kind of task:
It'd be better to use a parser generator like ANTLR, LEX+YACC, FLEX+BISON, or any other commonly used parser generator. Most of them come with complete examples on how to build simple calculators that support grouping and function calls.
Upvotes: 0
Reputation: 16584
I'm sorry to burst the RegEx bubble, but this is one of those things that you just can't do effectively with regular expressions alone.
What you're implementing is basically an Operator-Precedence Parser with support for sub-expressions and argument lists. The statement is processed as a stream of tokens - possibly using regular expressions - with sub-expressions processed as high-priority operations.
With the right code you can do this as an iteration over the full token stream, but recursive parsers are common too. Either way you have to be able to effectively push state and restart parsing at each of the sub-expression entry points - a (
, ,
or <function_name>(
token - and pushing the result up the parser chain at the sub-expression exit points - )
or ,
token.
Upvotes: 4
Reputation: 4001
Regular expressions aren't going to get you completely out of trouble with this...
Since you have nested parentheses, you need to modify your code to count (
against )
. When you encounter an (
, you need to take note of the position then look ahead, incrementing a counter for each extra (
you find, and decrementing it for each )
you find. When your counter is 0 and you find a )
, that is the end of your function parameter block, and you can then parse the text between the parentheses. You can also split the text on ,
when the counter is 0 to get function parameters.
If you encounter the end of the string while the counter is 0, you have a "(" without ")"
error.
You then take the text block(s) between the opening and closing parentheses and any commas, and repeat the above for each parameter.
Upvotes: 0