Betafish
Betafish

Reputation: 1262

Remove all bracketed text except for percentages

I'm trying to write a regex for removing text within brackets () or []. But, only places where it's not numbers with a percent symbol. Also, to remove the farthest bracket.

2.1.1. Berlin (/bɜːrˈlɪn/; German: [bɛʁˈliːn] (About this soundlisten)) is the capital and largest city of Germany by both area and population.[5][6] Its 3,769,495 (2019)[2] inhabitants make it the most populous city proper of the European Union. The two cities are at the center of the Berlin-Brandenburg capital region, which is, with about six million inhabitants. By 1700, approximately 30 percent (30%) of Berlin's residents were French, because of the Huguenot immigration.[40] Many other immigrants came from Bohemia, Poland, and Salzburg.

What I have now is removing everything between the brackets. But not considering the far end of the bracket.

re.sub("[\(\[].*?[\)\]]", "", sentence).strip()

Upvotes: 1

Views: 141

Answers (1)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626961

You may remove all substrings between nested square brackets and remove all substrings inside parentheses except those with a number and a percentage symbol inside with

import re

def remove_text_nested(text, pattern):
    n = 1  # run at least once
    while n:
        text, n = re.subn(pattern, '', text)  # remove non-nested/flat balanced parts
    return text

text = "Berlin (/bɜːrˈlɪn/; German: [bɛʁˈliːn] (About this soundlisten)) is the capital and largest city of Germany by both area and population.[5][6] Its 3,769,495 (2019)[2] inhabitants make it the most populous city proper of the European Union. The two cities are at the center of the Berlin-Brandenburg capital region, which is, with about six million inhabitants. By 1700, approximately 30 percent (30%) of Berlin's residents were French, because of the Huguenot immigration.[40] Many other immigrants came from Bohemia, Poland, and Salzburg."
text = remove_text_nested(text, r'\((?!\d+%\))[^()]*\)')
text = remove_text_nested(text, r'\[[^][]*]')
print(text)

Output:

Berlin  is the capital and largest city of Germany by both area and population. Its 3,769,495  inhabitants make it the most populous city proper of the European Union. The two cities are at the center of the Berlin-Brandenburg capital region, which is, with about six million inhabitants. By 1700, approximately 30 percent (30%) of Berlin's residents were French, because of the Huguenot immigration. Many other immigrants came from Bohemia, Poland, and Salzburg.

See the Python demo

Basically, the remove_text_nested method removes all matches in a loop until no replacement occurs.

The \((?!\d+%\))[^()]*\) pattern matches (, then fails the match if there are 1+ digits, %) to the right of the current location, then matches 0+ chars other than ( and ) and then matches ). See this regex demo.

The \[[^][]*] pattern simply matches [, then 0 or more chars other than [ and ] and then a ]. See the regex demo.

Upvotes: 3

Related Questions