user10895916
user10895916

Reputation:

Regular expression to return string split up respecting nested parentheses

I know many answers exist to the question on how to split up a string respecting parenthesis, but they never do so recursively. Looking at the string 1 2 3 (test 0, test 0) (test (0 test) 0):
Regex \s(?![^\(]*\)) returns "1", "2", "3", "(test 0, test 0)", "(test", "(0 test) 0)"
The regex I'm looking for would return either
"1", "2", "3", "(test 0, test 0)", "(test (0 test)0)"
or
"1", "2", "3", "test 0, test 0", "test (0 test)0"
which would let me recursively use it on the results again until no parentheses remain.
Ideally it would also respect escaped parentheses, but I myself am not this advanced in regex knowing only the basics.
Does anyone have an idea on how to take on this?

Upvotes: 2

Views: 1083

Answers (3)

quasi-human
quasi-human

Reputation: 1928

Alternatively, you can use pyparsing as well.

import pyparsing as pp

pattern = pp.ZeroOrMore(pp.Regex(r'\S+') ^ pp.original_text_for(pp.nested_expr('(', ')')))

# Tests
string = '1 2 3 (test 0, test 0) (test (0 test) 0)'
result = pattern.parse_string(string).as_list()
answer = ['1', '2', '3', '(test 0, test 0)', '(test (0 test) 0)']
assert result == answer

string = ''
result = pattern.parse_string(string).as_list()
answer = []
assert result == answer

string = 'a'
result = pattern.parse_string(string).as_list()
answer = ['a']
assert result == answer

string = ' a (1) ! '
result = pattern.parse_string(string).as_list()
answer = ['a', '(1)', '!']
assert result == answer

string = ' a (b) cd (e f) g hi (j (k l) m) (o p (qr (s t) u v) w (x y) z)'
result = pattern.parse_string(string).as_list()
answer = ['a', '(b)', 'cd', '(e f)', 'g', 'hi', '(j (k l) m)', '(o p (qr (s t) u v) w (x y) z)']
assert result == answer

* pyparsing can be installed by pip install pyparsing

In addition, you can directly parse all the nested parentheses at once:

pattern = pp.ZeroOrMore(pp.Regex(r'\S+') ^ pp.nested_expr('(', ')'))

string = '1 2 3 (test 0, test 0) (test (0 test) 0)'
result = pattern.parse_string(string).as_list()
answer = ['1', '2', '3', ['test', '0,', 'test', '0'], ['test', ['0', 'test'], '0']]
assert result == answer

* Whitespace is a delimiter in this case.

Note:

If a pair of parentheses gets broken inside () (for example a(b(c), a(b)c), etc), an unexpected result is obtained or IndexError is raised. So be careful to use. (See: Python extract string in a phrase)

Upvotes: 1

niko
niko

Reputation: 5281

Using regex only for the task might work but it wouldn't be straightforward.

Another possibility is writing a simple algorithm to track the parentheses in the string:

  1. Split the string at all parentheses, while returning the delimiter (e.g. using re.split)
  2. Keep a counters tracking the parentheses: start_parens_count for ( and end_parens_count for ).
  3. Using the counters, proceed by either splitting at white spaces or adding the current data into a temp var ( term)
  4. When the left most parenthesis has been closed, append term to the list of values & reset the counters/temp vars.

Here's an example:

import re

string = "1 2 3 (test 0, test 0) (test (0 test) 0)"


result, start_parens_count, end_parens_count, term = [], 0, 0, ""
for x in re.split(r"([()])", string):
    if not x.strip():
        continue
    elif x == "(":
        if start_parens_count > 0:
            term += "("
        start_parens_count += 1
    elif x == ")":
        end_parens_count += 1
        if end_parens_count == start_parens_count:
            result.append(term)
            end_parens_count, start_parens_count, term = 0, 0, ""
        else:
            term += ")"
    elif start_parens_count > end_parens_count:
        term += x
    else:
        result.extend(x.strip(" ").split(" "))


print(result)
# ['1', '2', '3', 'test 0, test 0', 'test (0 test) 0']

Not very elegant, but works.

Upvotes: 2

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626896

You can use pip install regex and use

import regex
text = "1 2 3 (test 0, test 0) (test (0 test) 0)"
matches = [match.group() for match in regex.finditer(r"(?:(\((?>[^()]+|(?1))*\))|\S)+", text)]
print(matches)
# => ['1', '2', '3', '(test 0, test 0)', '(test (0 test) 0)']

See the online Python demo. See the regex demo. The regex matches:

  • (?: - start of a non-capturing group:
    • (\((?>[^()]+|(?1))*\)) - a text between any nested parentheses
  • | - or
    • \S - any non-whitespace char
  • )+ - end of the group, repeat one or more times

Upvotes: 1

Related Questions