Reputation: 876
I'm parsing html
code in a C#
project.
Assuming that we have this string:
<a href="javascript:func('data1','data2'...)">...</a>
Or that after the necessary .subtring()
's this one:
func('data1','data2'...)
What would be the best Regex
pattern to retrieve func()
's parameters avoiding counting on delimiter characters (' and ,) as they could sometimes be part of the parameter's string?
Upvotes: 2
Views: 2069
Reputation: 34429
Try this
string input = "<a href=\"javascript:func('data1','data2'...)\">...</a>";
string pattern1 = @"\w+\((?'parameters'[^\)]+)\)";
Regex expr1 = new Regex(pattern1);
Match match1 = expr1.Match(input);
string parameters = match1.Groups["parameters"].Value;
string pattern2 = @"\w+";
Regex expr2 = new Regex(pattern2);
MatchCollection matches = expr2.Matches(parameters);
List<string> results = new List<string>();
foreach (Match match in matches)
{
results.Add(match.Value);
}
Upvotes: -2
Reputation: 6511
You should not use regex to parse programming language code, because it's not a regular language. This article explains why: Can regular expressions be used to match nested patterns?
And to prove my point, allow me to share an actual solution with a regex that I think will match what you want:
^ # Start of string
[^()'""]+\( # matches `func(`
#
(?> # START - Iterator (match each parameter)
(?(param)\s*,(?>\s*)) # if it's not the 1st parameter, start with a `,`
(?'param' # opens 'param' (main group, captures each parameter)
#
(?> # Group: matches every char in parameter
(?'qt'['""]) # ALTERNATIVE 1: strings (matches ""foo"",'ba\'r','g)o\'o')
(?: # match anything inside quotes
[^\\'""]+ # any char except quotes or escapes
|(?!\k'qt')['""] # or the quotes not used here (ie ""double'quotes"")
|\\. # or any escaped char
)* # repeat: *
\k'qt' # close quotes
| (?'parens'\() # ALTERNATIVE 2: `(` open nested parens (nested func)
| (?'-parens'\)) # ALTERNATIVE 3: `)` close nested parens
| (?'braces'\{) # ALTERNATIVE 4: `{` open braces
| (?'-braces'}) # ALTERNATIVE 5: `}` close braces
| [^,(){}\\'""] # ALTERNATIVE 6: anything else (var, funcName, operator, etc)
| (?(parens),) # ALTERNATIVE 7: `,` a comma if inside parens
| (?(braces),) # ALTERNATIVE 8: `,` a comma if inside braces
)* # Repeat: *
# CONDITIONS:
(?(parens)(?!)) # a. balanced parens
(?(braces)(?!)) # b. balanced braces
(?<!\s) # c. no trailing spaces
#
) # closes 'param'
)* # Repeat the whole thing once for every parameter
#
\s*\)\s*(?:;\s*)? # matches `)` at the end if func(), maybe with a `;`
$ # END
One-liner:
^[^()'""]+\((?>(?(param)\s*,(?>\s*))(?'param'(?>(?'qt'['""])(?:[^\\'""]+|(?!\k'qt')['""]|\\.)*\k'qt'|(?'parens'\()|(?'-parens'\))|(?'braces'\{)|(?'-braces'})|[^,(){}\\'""]|(?(parens),)|(?(braces),))*(?(parens)(?!))(?(braces)(?!))(?<!\s)))*\s*\)\s*(?:;\s*)?$
As you can imagine by now (if you're still reading), even with an indented pattern and with comments for every construct, this regex is unreadable, quite difficult to mantain and almost impossible to debug... And I can guess there will be exceptions that would make it fail.
Just in case a stubborn mind is still interested, here's a link to the logic behind it: Matching Nested Constructs with Balancing Groups (regular-expressions.info)
Upvotes: 5