Reputation: 720

Split out certain part of string with regex in python

I have different strings of the form _AHDHDUHD[Tsfs (SGYA)]AHUDSHDI_ and I want to cut out the (SGYA) part (always capital letters in round brackets) and eventual spaces directly before or after it. So the result should be _AHDHDUHD[Tsfs]AHUDSHDI_.

I had the idea of matching the content of the square brackets with ([A-Z_])(\[.+\])([A-Z_]) and then doing a split and re-inserting it using re module (although I am not sure which re function is suited for this).

However, this feels inelegant. Is there a regex that would do what I want directly, without the intermediary steps?

Upvotes: 0

Answers (5)

The fourth bird

Reputation: 163207

You could use 2 capturing groups and in the replacement use both capturing groups \1\2

([A-Z_]+\[[^(\s]+)[^\S\r\n]*\([A-Z]+\)[^\S\r\n]*(\][A-Z_]+)

In parts

( Capture group 1
- [A-Z_]+ Match 1+ chars A-Z or _
- \[[^(\s]+ Match [ and 1+ any chars except the listed
) Close group
[^\S\r\n]* Match 0+ whitespace chars except newline
\([A-Z]+\) Match chars A-Z between parenthesis
[^\S\r\n]* Match 0+ whitespace chars except newline
( Capture group 2
- \][A-Z_]+ Match ] and 1+ chars A-Z or _
) Close group

Regex demo | Python demo

For example

import re

regex = r"([A-Z_]+\[[^(\s]+)[^\S\r\n]*\([A-Z]+\)[^\S\r\n]*(\][A-Z_]+)"
test_str = "_AHDHDUHD[Tsfs (SGYA)]AHUDSHDI_"
print(re.sub(regex, r"\1\2", test_str))

Output

_AHDHDUHD[Tsfs]AHUDSHDI_

Upvotes: 0

accdias

Reputation: 5372

This will do what you want:

Python 3.7.5 (default, Oct 17 2019, 12:16:48) 
[GCC 9.2.1 20190827 (Red Hat 9.2.1-1)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import re
>>> s='_AHDHDUHD[Tsfs (SGYA)]AHUDSHDI_'
>>> re.sub(r'(?:\s?\((.*)\))', '', s)
'_AHDHDUHD[Tsfs]AHUDSHDI_'
>>>

If you want to only match capital letters inside square brackets, then the expression should be:

>>> re.sub(r'(?:\s?\(([A-Z]+)\))', '', s)
'_AHDHDUHD[Tsfs]AHUDSHDI_'
>>>

I hope it helps.

Upvotes: 1

ChafikZ

Reputation: 23

You are looking for the re.sub function

import re
s = "AHDHDUHD[Tsfs (SGYA)]AHUDSHDI" 
s_re = re.sub("(.*?)(\s*\(.*?\)\s*)(.*?)", '', s)
print (s_re)

It will print:

AHDHDUHD[Tsfs]AHUDSHDI

Upvotes: 0

Wiktor Stribiżew

Reputation: 626728

You may use

re.sub(r'(\[[^][]*?)\s*\([A-Z]*\)\s*([^][]*])', r'\1\2', text)

See the regex demo

Details

(\[[^][]*?) - Group 1: a [ and then any 0+ chars other than [ and ] as few as possible
\s* - 0+ whitespaces
\( - a ( char
[A-Z]* - 0+ uppercase ASCII letters
\) - a ) char
\s* - 0+ whitespaces
([^][]*]) - Group 2: any 0+ chars other than ] and [ (as many as possible) and then a ]

Python demo:

import re
rx = r"(\[[^][]*?)\s*\([A-Z]*\)\s*([^][]*])"
s = "_AHDHDUHD[Tsfs (SGYA)]AHUDSHDI"
print( re.sub(rx, r'\1\2', s) )
# => _AHDHDUHD[Tsfs]AHUDSHDI

Another idea: only remove all \s*\([A-Z]+\)\s* matches when found inside [...] substrings:

import re
s = "_AHDHDUHD[Tsfs (SGYA)]AHUDSHDI"
print( re.sub(r"\[[^][]+]", lambda x: re.sub(r'\s*\([A-Z]+\)\s*', "", x.group()), s) )
# => _AHDHDUHD[Tsfs]AHUDSHDI

See another Python demo.

Here, the \[[^][]+] pattern will find all chunks of [, then 1+ chars other than square brackets and then a ], and then any occurrences of 0+ whitespaces, (, 1+ uppercase ASCII letters, ) and 0+ whitespaces will be removed only inside the matches found with the \[[^][]+] pattern.

Upvotes: 1

Morne

Reputation: 1743

import re


weirdstring =  "_AHDHDUHD[Tsfs (SGYA)]AHUDSHDI_"
weirdstring = re.sub(r'(.*?)(\s*\(.*?\)\s*)(.*?)', r'\1\3', weirdstring)

print(weirdstring)

# prints _AHDHDUHD[Tsfs]AHUDSHDI_

Upvotes: 1

Split out certain part of string with regex in python

Answers (5)

Related Questions