Reputation: 107
string: abc keyword1 ddd 111 ddd (ddd 99/ddd) 1 ddd (ddd) ddd 11 ddd keyword2 abc
regex: re.compile(r'(?:keyword1)(.*)(?:keyword2)', flags = re.DOTALL | re.MULTILINE)
goal: exclude all digits except the ones within brackets from match
desired output: 'ddd ddd (ddd 99/ddd) ddd (ddd) ddd ddd'
approach1: Any digit within brackets is always 99 but the digits outside of brackets can also be 99. That is why i could also remove every digit from matching, except 99 and subsequently use not regex to remove the remaining 99s outside of brackets?!
approach2: match ddd
(basically everything including 99
s) except all other digits using some variant of the help below. I played with the (\([^)]*\)|\S)*
around but failed prob because its java :D
Question: Which approach makes sense? How can i modify my regex to reach my goal?
related help
Exclude strings within parentheses from a regular expression?
(\([^)]*\)|\S)*
where one balanced set of parentheses is treated as if it were a single character, and so the regex as a whole matches a single word, where a word can contain these parenthesized groups.
Upvotes: 1
Views: 110
Reputation: 626929
Without any additional packages, you can use a two step approach: get the string between keywords and then remove all digit chunks that are not inside parentheses:
import re
s = "abc keyword1 ddd 111 ddd (ddd 99/ddd) 1 ddd (ddd) ddd 11 ddd keyword2 abc"
m = re.search(r'keyword1(.*?)keyword2', s, re.I | re.S)
if m:
print( re.sub(r'(\([^()]*\))|\s*\d+', r'\1', m.group(1)) )
## => ddd ddd (ddd 99/ddd) ddd (ddd) ddd ddd
See the Python demo.
Notes:
keyword1(.*?)keyword2
extracts all contents between keyword1
and keywor2
into Group 1re.sub(r'(\([^()]*\))|\s*\d+', r'\1', m.group(1))
removes any digit chunks preceded with optional whitespace from the Group 1 value while keeping all strings between (
and )
intact.Upvotes: 2