sdgd
sdgd

Reputation: 733

Remove duplicate words in a string using regex

I'm working on my regex skills and i find one of my strings having duplicate words at the starting. I would like to remove the duplicate and just have one word of it -

server_server_dev1_check_1233.zzz
server_server_qa1_run_1233.xyz
server_server_dev2_1233.qqa
server_dev1_1233.zzz
data_data_dev9_check_660.log

I used the below regex but i get both server_server in my output,

((.*?))_(?!\D)

How can i have my output just to one server_ if there are two or more and if its only one server_, then take as is? The output doesn't have to contain the digits and also the part after . i.e. .zzz, .xyz etc

Expected output -

server_dev1_check
server_qa1_run
server_dev2
server_dev1
data_dev9_check

Upvotes: 3

Views: 1428

Answers (2)

Jean-François Fabre
Jean-François Fabre

Reputation: 140148

you could back reference the word in your search expression:

>>> s = "server_server_dev1_check_1233.zzz"
>>> re.sub(r"(.*_)\1",r"\1",s)
'server_dev1_check_1233.zzz'

and use the "many times" suffix so if there are more than 2 occurrences it still works:

'server_server_server_dev1_check_1233.zzz'
>>> re.sub(r"(.*_)\1{1,}",r"\1",s)
'server_dev1_check_1233.zzz'

getting rid of the suffix is not the hardest part, just capture the rest and discard the end:

>>> re.sub(r"(.*_)\1{1,}(.*)(_\d+\..*)",r"\1\2",s)
'server_dev1_check'

Upvotes: 4

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626699

You may use a single re.sub call to match and remove what you do not need and match and capture what you need:

re.sub(r'^([^_]+)(?:_\1)*(.*)_\d+\.\w+$', r'\1\2', s)

See the regex demo

Details

  • ^ - start of string
  • ([^_]+) - Capturing group 1: any 1+ chars other than _
  • (?:_\1)* - zero or more repetitions of _ followed with the same substring as in Group 1 (thanks to the inline backreference \1 that retrieves the text from Group 1)
  • (.*) - Group 2: any 0+ chars, as many as possible
  • _ - an underscore
  • \d+ - 1+ digits
  • \. - a dot
  • \w+ - 1+ word chars ([^.]+ will also do, 1 or more chars other than .)
  • $ - end of string.

The replacement pattern is \1\2, i.e. the contents of Group 1 and 2 are concatenated and make up the resulting value.

Python demo:

import re
rx = r'^([^_]+)(?:_\1)*(.*)_\d+\.\w+$'
strs = ["server_server_dev1_check_1233.zzz", "server_server_qa1_run_1233.xyz", "server_server_dev2_1233.qqa", "server_dev1_1233.zzz", "data_data_dev9_check_660.log"]
for s in strs:
    print(re.sub(rx, r'\1\2', s))

Output:

server_dev1_check
server_qa1_run
server_dev2
server_dev1
data_dev9_check

Upvotes: 3

Related Questions