Reputation: 1108
I need to parse expressions of the following form:
(S (A (B (D xyz)) (C m)))
The amount of ( will always equal the amount of ), but there may be any arbitrary number of opening and closing parenthesis pairs between the (S ). In this case, I would want to extract the (A (B (D xyz)) (C m)). There may be any number of (S ) clauses in a file, so I can't simply do a ^(S .* )$ kind of pattern matching.
If I knew the amount of potential opening and closing parenthesis pairs between the (S ) this wouldn't be as difficult, but I'm not sure how to write a regular expression that will know to match an arbitrary amount of ().
Any help in obtaining the regexp pattern would be greatly appreciated. Thanks in advance.
Upvotes: 3
Views: 1005
Reputation:
Probably a record descent parse would be the best bet. But if you just want to find (S) balanced, it could be done with a regex that does recursion in the engine.
It will find the outter most balance. If you were looking to nesting like (S(S)) that might involve recursively calling a function that implements the regex, passing the 'core' of a sucessfull match. And possibly creating a parent-child structure in the process. But if its this involved, a real parser might be the solution.
How it can be solved with a Perl regex -
$str = '(some (stuff (S (A (B (D xyz)) (C m))) the end ) (S extra))';
$regex = qr~
[(]
\s* S \s*
( # 1
( # 2
[(]
(?: (?> [^()]+ )
| (?2)
)*
[)]
)
|
[^)]*
)
[)]
~x;
while ($str =~ /$regex/g)
{
print "found '$1'\n";
}
prints
found '(A (B (D xyz)) (C m))'
found 'extra'
Upvotes: 0
Reputation: 726609
This cannot be done in theory, and can be done in practice only when maximal number of nested parentheses is known upfront. That solution requires a rather non-pleasant expression, and is usually attempted as a curious homework exercise. Here is a link with better explanation of why regexp language is not powerful enough to solve the matching parentheses problem.
You need a parser to solve this problem; a simple recursive descent one will do the trick. Wikipedia article at the link above has a sample implementation in C, which you should be able to translate to other languages with relative ease.
Upvotes: 1
Reputation: 284836
Matching an arbitrary number is impossible with pure regular expressions. In other words, you can not match a count that it is unknown when you're generating/writing the regular expression is impossible. Matching n
pairs (however high n
is) is possible as long as you know n
when generating the regex.
Upvotes: 1