Reputation: 25997
When I have a string like this:
s1 = 'stuff(remove_me)'
I can easily remove the parentheses and the text within using
# returns 'stuff'
res1 = re.sub(r'\([^)]*\)', '', s1)
as explained here.
But I sometimes encounter nested expressions like this:
s2 = 'stuff(remove(me))'
When I run the command from above, I end up with
'stuff)'
I also tried:
re.sub('\(.*?\)', '', s2)
which gives me the same output.
How can I remove everything within the outer parentheses - including the parentheses themselves - so that I also end up with 'stuff'
(which should work for arbitrarily complex expressions)?
Upvotes: 9
Views: 14730
Reputation: 2428
Just computing the difference between the cumulative counts of '('
and ')'
:
import numpy as np
s = '()a(x(x)x)b(x)c()d()'
s_array = np.array(list(s))
mask_open = s_array=='('
mask_close = s_array==')'
# Compute in how many parentheses each character is nested,
# while considering ')' as not nested:
nestedness_except_close = np.cumsum(mask_open) - np.cumsum(mask_close)
# ... and while considering ')' as nested:
nestedness = nestedness_except_close + mask_close
# Select only characters that aren't in any parentheses
result = ''.join(s_array[nestedness < 1])
This might be faster than other solutions.
Optional validity checks for the string:
# Check whether the number of `'('`s and `')'`s is the same
assert(nestedness_except_close[-1] == 0)
# Check whether some parentheses get closed before they got opened
assert((nestedness_except_close >= 0).all())
If you don't want to use NumPy, you can use itertools.accumulate()
to compute the cumulative sums.
Upvotes: 0
Reputation: 389
I have found a solution here:
http://rachbelaid.com/recursive-regular-experession/
which says:
>>> import regex
>>> regex.search(r"^(\((?1)*\))(?1)*$", "()()") is not None
True
>>> regex.search(r"^(\((?1)*\))(?1)*$", "(((()))())") is not None
True
>>> regex.search(r"^(\((?1)*\))(?1)*$", "()(") is not None
False
>>> regex.search(r"^(\((?1)*\))(?1)*$", "(((())())") is not None
False
Upvotes: 0
Reputation: 626748
NOTE: \(.*\)
matches the first (
from the left, then matches any 0+ characters (other than a newline if a DOTALL modifier is not enabled) up to the last )
, and does not account for properly nested parentheses.
To remove nested parentheses correctly with a regular expression in Python, you may use a simple \([^()]*\)
(matching a (
, then 0+ chars other than (
and )
and then a )
) in a while block using re.subn
:
def remove_text_between_parens(text):
n = 1 # run at least once
while n:
text, n = re.subn(r'\([^()]*\)', '', text) # remove non-nested/flat balanced parts
return text
Bascially: remove the (...)
with no (
and )
inside until no match is found. Usage:
print(remove_text_between_parens('stuff (inside (nested) brackets) (and (some(are)) here) here'))
# => stuff here
A non-regex way is also possible:
def removeNestedParentheses(s):
ret = ''
skip = 0
for i in s:
if i == '(':
skip += 1
elif i == ')'and skip > 0:
skip -= 1
elif skip == 0:
ret += i
return ret
x = removeNestedParentheses('stuff (inside (nested) brackets) (and (some(are)) here) here')
print(x)
# => 'stuff here'
Upvotes: 20
Reputation: 18490
As mentioned before, you'd need a recursive regex for matching arbitrary levels of nesting but if you know there can only be a maximum of one level of nesting have a try with this pattern:
\((?:[^)(]|\([^)(]*\))*\)
[^)(]
matches a character, that is not a parenthesis (negated class).|\([^)(]*\)
or it matches another (
)
pair with any amount of non )(
inside.(?:
...)*
all this any amount of times inside (
)
Before the alternation used [^)(]
without +
quantifier to fail faster if unbalanced.
You need to add more levels of nesting that might occure. Eg for max 2 levels:
\((?:[^)(]|\((?:[^)(]|\([^)(]*\))*\))*\)
Upvotes: 6
Reputation: 6729
https://regex101.com/r/kQ2jS3/1
'(\(.*\))'
This captures the furthest
parentheses, and everything in between the parentheses.
Your old regex captures the first parentheses, and everything between to the next
parentheses.
Upvotes: 1
Reputation: 148890
If you are sure that the parentheses are initially balanced, just use the greedy version:
re.sub(r'\(.*\)', '', s2)
Upvotes: 1
Reputation: 784
re
matches are eager so they try to match as much text as possible, for the simple test case you mention just let the regex run:
>>> re.sub(r'\(.*\)', '', 'stuff(remove(me))')
'stuff'
Upvotes: 1