kgully
kgully

Reputation: 662

Regular Expression to match string inside parentheses

I am trying to match all the strings that are enclosed by parentheses. For example for the string below:

((5.85B8.5V + ?; 1.79") + (6.78A0 + ?; .97"); 4.760")

I want to match ALL of the following:

I have never really gotten a hang of regular expressions, so I am having trouble. What I have right now is:

pattern = '\((.*?)\)'
m = re.match(pattern, string)
print m.group()
'((5.85B8.5V + ?; 1.79")'

That is kind of close, but gets the first parentheses of both types instead of the opening and closing parentheses. Any ideas?

Thanks!

Upvotes: 2

Views: 432

Answers (3)

georg
georg

Reputation: 215059

Regular expressions are bad at parsing nested structures, however, some dialects provide the recursive operator ?R or ?n which can help you on that. Python's stock re doesn't support it, but fortunately, there is regex module that does:

>>> import regex
>>> s = '((5.85B8.5V + ?; 1.79") + (6.78A0 + ?; .97"); 4.760")'
>>> regex.findall(r'(?=(\((?:[^()]|(?1))*\)))', s)
['((5.85B8.5V + ?; 1.79") + (6.78A0 + ?; .97"); 4.760")', '(5.85B8.5V + ?; 1.79")', '(6.78A0 + ?; .97")']

That said, regexes are not your best option to parse generic context-free languages (which your string apparently belongs to). Consider using a real parser instead, which you can build with pyParsing or similar package, or just code by hand - that's rather trivial here:

def expressions(s):
    stack = []
    for n, c in enumerate(s):
        if c == '(':
            stack.append(n+1)
        elif c == ')':
            yield s[stack.pop():n]

for x in expressions(s):
    print x

Upvotes: 3

Avinash Raj
Avinash Raj

Reputation: 174874

Use lookarounds inorder to do an overlapping match.

>>> s = '((5.85B8.5V + ?; 1.79") + (6.78A0 + ?; .97"); 4.760")'
>>> re.findall(r'(?=\(([^()]*|.*)\))', s)
['(5.85B8.5V + ?; 1.79") + (6.78A0 + ?; .97"); 4.760"', '5.85B8.5V + ?; 1.79"', '6.78A0 + ?; .97"']

DEMO

We could do an overlapping match through lookarounds only. It's impossible to do without lookarounds. So put your pattern inside a positive lookahead assertion.

\(([^()]*|.*)\) \( at the first matches a literal ( symbol. () called capturing groups. [^()]* matches any char but not of ( or ) zero or more times or | match any character zero or more times greedily .* upto the last \) symbol.

Upvotes: 1

Kasravnd
Kasravnd

Reputation: 107357

You can use re.findall but for your inner grouping one time use r'\(([^()]*)\)' and one time use (.*) (for match the whole of string ) :

>>> import pprint
>>> l= re.findall(r'\(([^()]*)\)',s)+re.findall(r'\((.*)\)',s)
>>> pprint.pprint(l)
['5.85B8.5V + ?; 1.79"',
 '6.78A0 + ?; .97"',
 '(5.85B8.5V + ?; 1.79") + (6.78A0 + ?; .97"); 4.760"']

Upvotes: 1

Related Questions