Reputation: 662
I am trying to match all the strings that are enclosed by parentheses. For example for the string below:
((5.85B8.5V + ?; 1.79") + (6.78A0 + ?; .97"); 4.760")
I want to match ALL of the following:
I have never really gotten a hang of regular expressions, so I am having trouble. What I have right now is:
pattern = '\((.*?)\)'
m = re.match(pattern, string)
print m.group()
'((5.85B8.5V + ?; 1.79")'
That is kind of close, but gets the first parentheses of both types instead of the opening and closing parentheses. Any ideas?
Thanks!
Upvotes: 2
Views: 432
Reputation: 215059
Regular expressions are bad at parsing nested structures, however, some dialects provide the recursive operator ?R
or ?n
which can help you on that. Python's stock re
doesn't support it, but fortunately, there is regex
module that does:
>>> import regex
>>> s = '((5.85B8.5V + ?; 1.79") + (6.78A0 + ?; .97"); 4.760")'
>>> regex.findall(r'(?=(\((?:[^()]|(?1))*\)))', s)
['((5.85B8.5V + ?; 1.79") + (6.78A0 + ?; .97"); 4.760")', '(5.85B8.5V + ?; 1.79")', '(6.78A0 + ?; .97")']
That said, regexes are not your best option to parse generic context-free languages (which your string apparently belongs to). Consider using a real parser instead, which you can build with pyParsing or similar package, or just code by hand - that's rather trivial here:
def expressions(s):
stack = []
for n, c in enumerate(s):
if c == '(':
stack.append(n+1)
elif c == ')':
yield s[stack.pop():n]
for x in expressions(s):
print x
Upvotes: 3
Reputation: 174874
Use lookarounds inorder to do an overlapping match.
>>> s = '((5.85B8.5V + ?; 1.79") + (6.78A0 + ?; .97"); 4.760")'
>>> re.findall(r'(?=\(([^()]*|.*)\))', s)
['(5.85B8.5V + ?; 1.79") + (6.78A0 + ?; .97"); 4.760"', '5.85B8.5V + ?; 1.79"', '6.78A0 + ?; .97"']
We could do an overlapping match through lookarounds only. It's impossible to do without lookarounds. So put your pattern inside a positive lookahead assertion.
\(([^()]*|.*)\)
\(
at the first matches a literal (
symbol. ()
called capturing groups. [^()]*
matches any char but not of (
or )
zero or more times or |
match any character zero or more times greedily .*
upto the last \)
symbol.
Upvotes: 1
Reputation: 107357
You can use re.findall
but for your inner grouping one time use r'\(([^()]*)\)'
and one time use (.*)
(for match the whole of string ) :
>>> import pprint
>>> l= re.findall(r'\(([^()]*)\)',s)+re.findall(r'\((.*)\)',s)
>>> pprint.pprint(l)
['5.85B8.5V + ?; 1.79"',
'6.78A0 + ?; .97"',
'(5.85B8.5V + ?; 1.79") + (6.78A0 + ?; .97"); 4.760"']
Upvotes: 1