Reputation: 1262
I'm trying to write a regex for removing text within brackets ()
or []
. But, only places where it's not numbers with a percent symbol. Also, to remove the farthest bracket.
2.1.1. Berlin (/bɜːrˈlɪn/; German: [bɛʁˈliːn] (About this soundlisten)) is the capital and largest city of Germany by both area and population.[5][6] Its 3,769,495 (2019)[2] inhabitants make it the most populous city proper of the European Union. The two cities are at the center of the Berlin-Brandenburg capital region, which is, with about six million inhabitants. By 1700, approximately 30 percent (30%) of Berlin's residents were French, because of the Huguenot immigration.[40] Many other immigrants came from Bohemia, Poland, and Salzburg.
What I have now is removing everything between the brackets. But not considering the far end of the bracket.
re.sub("[\(\[].*?[\)\]]", "", sentence).strip()
Upvotes: 1
Views: 141
Reputation: 626961
You may remove all substrings between nested square brackets and remove all substrings inside parentheses except those with a number and a percentage symbol inside with
import re
def remove_text_nested(text, pattern):
n = 1 # run at least once
while n:
text, n = re.subn(pattern, '', text) # remove non-nested/flat balanced parts
return text
text = "Berlin (/bɜːrˈlɪn/; German: [bɛʁˈliːn] (About this soundlisten)) is the capital and largest city of Germany by both area and population.[5][6] Its 3,769,495 (2019)[2] inhabitants make it the most populous city proper of the European Union. The two cities are at the center of the Berlin-Brandenburg capital region, which is, with about six million inhabitants. By 1700, approximately 30 percent (30%) of Berlin's residents were French, because of the Huguenot immigration.[40] Many other immigrants came from Bohemia, Poland, and Salzburg."
text = remove_text_nested(text, r'\((?!\d+%\))[^()]*\)')
text = remove_text_nested(text, r'\[[^][]*]')
print(text)
Output:
Berlin is the capital and largest city of Germany by both area and population. Its 3,769,495 inhabitants make it the most populous city proper of the European Union. The two cities are at the center of the Berlin-Brandenburg capital region, which is, with about six million inhabitants. By 1700, approximately 30 percent (30%) of Berlin's residents were French, because of the Huguenot immigration. Many other immigrants came from Bohemia, Poland, and Salzburg.
See the Python demo
Basically, the remove_text_nested
method removes all matches in a loop until no replacement occurs.
The \((?!\d+%\))[^()]*\)
pattern matches (
, then fails the match if there are 1+ digits, %)
to the right of the current location, then matches 0+ chars other than (
and )
and then matches )
. See this regex demo.
The \[[^][]*]
pattern simply matches [
, then 0 or more chars other than [
and ]
and then a ]
. See the regex demo.
Upvotes: 3