Reputation: 733
I'm working on my regex skills and i find one of my strings having duplicate words at the starting. I would like to remove the duplicate and just have one word of it -
server_server_dev1_check_1233.zzz
server_server_qa1_run_1233.xyz
server_server_dev2_1233.qqa
server_dev1_1233.zzz
data_data_dev9_check_660.log
I used the below regex but i get both server_server in my output,
((.*?))_(?!\D)
How can i have my output just to one server_
if there are two or more and if its only one server_
, then take as is?
The output doesn't have to contain the digits
and also the part after .
i.e. .zzz, .xyz
etc
Expected output -
server_dev1_check
server_qa1_run
server_dev2
server_dev1
data_dev9_check
Upvotes: 3
Views: 1428
Reputation: 140148
you could back reference the word in your search expression:
>>> s = "server_server_dev1_check_1233.zzz"
>>> re.sub(r"(.*_)\1",r"\1",s)
'server_dev1_check_1233.zzz'
and use the "many times" suffix so if there are more than 2 occurrences it still works:
'server_server_server_dev1_check_1233.zzz'
>>> re.sub(r"(.*_)\1{1,}",r"\1",s)
'server_dev1_check_1233.zzz'
getting rid of the suffix is not the hardest part, just capture the rest and discard the end:
>>> re.sub(r"(.*_)\1{1,}(.*)(_\d+\..*)",r"\1\2",s)
'server_dev1_check'
Upvotes: 4
Reputation: 626699
You may use a single re.sub
call to match and remove what you do not need and match and capture what you need:
re.sub(r'^([^_]+)(?:_\1)*(.*)_\d+\.\w+$', r'\1\2', s)
See the regex demo
Details
^
- start of string([^_]+)
- Capturing group 1: any 1+ chars other than _
(?:_\1)*
- zero or more repetitions of _
followed with the same substring as in Group 1 (thanks to the inline backreference \1
that retrieves the text from Group 1)(.*)
- Group 2: any 0+ chars, as many as possible_
- an underscore\d+
- 1+ digits\.
- a dot\w+
- 1+ word chars ([^.]+
will also do, 1 or more chars other than .
)$
- end of string.The replacement pattern is \1\2
, i.e. the contents of Group 1 and 2 are concatenated and make up the resulting value.
import re
rx = r'^([^_]+)(?:_\1)*(.*)_\d+\.\w+$'
strs = ["server_server_dev1_check_1233.zzz", "server_server_qa1_run_1233.xyz", "server_server_dev2_1233.qqa", "server_dev1_1233.zzz", "data_data_dev9_check_660.log"]
for s in strs:
print(re.sub(rx, r'\1\2', s))
Output:
server_dev1_check
server_qa1_run
server_dev2
server_dev1
data_dev9_check
Upvotes: 3