Python account
Python account

Reputation: 437

Find string between reoccurring substrings?

I have a string similar to

s = "(test1 or (test2 or test3)) and (test4 and (test6)) and (test7 or test8) and test9"

I'm trying to extract between (),

['test1 or (test2 or test3)', 'test4 and (test6)', 'test7 or test8']

I have tried

result = re.search('%s(.*)%s' % ("(", ")"), s).group(1)
result =(s[s.find("(")+1 : s.find(")")])
result = re.search('((.*))', s)

Upvotes: 2

Views: 91

Answers (2)

Andrew Allen
Andrew Allen

Reputation: 8002

If you did want to make a rough parser for this it would look some like this.

This uses the scanner method of pattern objects, iterates through and builds the list when at level 0, where the level is defined through the left and right brackets encountered.

import re

# Token specification
TEST = r'(?P<TEST>test[0-9]*)'
LEFT_BRACKET = r'(?P<LEFT_BRACKET>\()'
RIGHT_BRACKET = r'(?P<RIGHT_BRACKET>\))'
AND = r'(?P<AND> and )'
OR = r'(?P<OR> or )'

master_pat = re.compile('|'.join([TEST, LEFT_BRACKET, RIGHT_BRACKET, AND, OR]))

s = "(test1 or (test2 or test3)) and (test4 and (test6)) and (test7 or test8) and test9"

def generate_list(pat, text):
    ans = []
    elem = ''
    level = 0
    scanner = pat.scanner(text)
    for m in iter(scanner.match, None):
        # print(m.lastgroup, m.group(), level)
        # keep building elem if nested or not tokens to skip for level=0,1
        if (level > 1 or
          (level == 1 and m.lastgroup != 'RIGHT_BRACKET') or
          (level == 0 and m.lastgroup not in ['LEFT_BRACKET', 'AND'])
        ):
            elem += m.group()
        # if at level 0 we can append
        if level == 0 and elem != '':
            ans.append(elem)
            elem = ''
        # set level
        if m.lastgroup == 'LEFT_BRACKET':
            level += 1
        elif m.lastgroup == 'RIGHT_BRACKET':
            level -= 1
    return ans


generate_list(master_pat, s)
# ['test1 or (test2 or test3)', 'test4 and (test6)', 'test7 or test8', 'test9']

To see how scanner behaves:

master_pat = re.compile('|'.join([TEST, LEFT_BRACKET, RIGHT_BRACKET, AND, OR]))
s = "(test1 or (test2 or test3)) and (test4 and (test6)) and (test7 or test8) and test9"

scanner = master_pat.scanner(s)
scanner.match()
# <re.Match object; span=(0, 1), match='('>
_.lastgroup, _.group()
# ('LEFT_BRACKET', '(')
scanner.match()
# <re.Match object; span=(1, 6), match='test1'>
_.lastgroup, _.group()
# ('TEST', 'test1')
scanner.match()
# <re.Match object; span=(6, 10), match=' or '>
_.lastgroup, _.group()
# ('OR', ' or ')
scanner.match()
# <re.Match object; span=(10, 11), match='('>
_.lastgroup, _.group()
# ('LEFT_BRACKET', '(')
scanner.match()
# <re.Match object; span=(11, 16), match='test2'>
_.lastgroup, _.group()
# ('TEST', 'test2')

Upvotes: 0

Jean-Fran&#231;ois Fabre
Jean-Fran&#231;ois Fabre

Reputation: 140188

you have nested parentheses. That calls for parsing, or if you don't want to go that far, back to basics, parse character by character to find the 0-nesting level of each group.

Then hack to remove the and tokens before if any.

The code I've written for this. Not short, not very complex either, self-contained, no extra libs:

s = "(test1 or (test2 or test3)) and (test4 and (test6)) and (test7 or test8) and test9"

nesting_level = 0
previous_group_index = 0

def rework_group(group):
    # not the brightest function but works. Maybe needs tuning
    # that's not the core of the algorithm but simple string operations
    # look for the first opening parenthese, remove what's before
    idx = group.find("(")
    if idx!=-1:
        group = group[idx:]
    else:
        # no parentheses: split according to blanks, keep last item
        group = group.split()[-1]
    return group

result = []

for i,c in enumerate(s):
    if c=='(':
        nesting_level += 1
    elif c==')':
        nesting_level -= 1
        if nesting_level == 0:
            result.append(rework_group(s[previous_group_index:i+1]))
            previous_group_index = i+1

result.append(rework_group(s[previous_group_index:]))

result:

>>> result
['(test1 or (test2 or test3))',
 '(test4 and (test6))',
 '(test7 or test8)',
 'test9']
>>> 

Upvotes: 2

Related Questions