nikinlpds
nikinlpds

Reputation: 411

Replace previous occurrence of string

I want to remove duplicated words inside brackets and replace them with "S" + word.

For eg:

(Skipper Skipper) -> (S Skipper)
('s 's) -> (S 's)

Here is the string, s:

s = "(S (S (S (S (Skipper Skipper) ('s 's)) (Inc. Inc.)) (S (Bellevue Bellevue) (Wash. Wash.))) 
     (S (said said) (S (it it) (S (signed signed) (S (a a) (S (definitive definitive) 
     (S (merger merger) (S (agreement agreement) (S (for for) (S (S (a a) 
     (S (National National) (S (Pizza Pizza) (S (Corp. Corp.) (unit unit))))) 
     (S (to to) (S (acquire acquire) (S (S (S (the the) (S (90.6 90.6) (% %))) 
     (S (S (of of) (S (S (Skipper Skipper) ('s 's)) (Inc. Inc.))) (S (it it) 
     (S (does does) (S (n't n't) (own own)))))) (S (for for) (S (S (11.50 11.50) 
     (S (a a) (share share))) (S (or or) (S (about about) (S (28.1 28.1) (million million)))))))))))))))))))"

Expected result:

out = "(S (S (S (S (S Skipper) (S 's)) (S Inc.)) (S (S Bellevue) (S Wash.))) 
       (S (S said) (S (S it) (S (S signed) (S (S a) (S (S definitive) (S (S merger) 
       (S (S agreement) (S (S for) (S (S (S a) (S (S National) (S (S Pizza) (S (S Corp.) 
       (S unit))))) (S (S to) (S (S acquire) (S (S (S (S the) (S (S 90.6) (S %))) (S (S (S of) 
       (S (S (S Skipper) (S 's)) (S Inc.))) (S (S it) (S (S does) (S (S n't) (S own)))))) 
       (S (S for) (S (S (S 11.50) (S (S a) (S share))) (S (S or) (S (S about) (S (S 28.1) (S million)))))))))))))))))))"

I tried to do:

from collections import Counter

lst = s.lstrip("(").rstrip(")").replace("(", "").replace(")", "").split()
d = Counter(lst)
mapper = {((k + " ") * v).strip():"S" + " " + k for k, v in d.items()}
for k, v in mapper.items():
    out = s.replace(k, v)

But not getting quite right:

out = "(S (S (S (S (Skipper Skipper) ('s 's)) (Inc. Inc.)) (S (S Bellevue) (S Wash.))) 
       (S (S said) (S (it it) (S (S signed) (S (a a) (S (S definitive) (S (S merger) 
       (S (S agreement) (S (for for) (S (S (a a) (S (S National) (S (S Pizza) (S (S Corp.) 
       (S unit))))) (S (S to) (S (S acquire) (S (S (S (S the) (S (S 90.6) (S %))) (S (S (S of) 
       (S (S (Skipper Skipper) ('s 's)) (Inc. Inc.))) (S (it it) (S (S does) (S (S n't) (S own)))))) 
       (S (for for) (S (S (S 11.50) (S (a a) (S share))) (S (S or) (S (S about) (S (S 28.1) (S million)))))))))))))))))))"                                                                                                                                                                               

Upvotes: 1

Views: 93

Answers (4)

alex_noname
alex_noname

Reputation: 32153

You can use re.sub and backreferences in regular expression.

For finding duplicate words you can use \1 that references the captured match of the first group, and \g<1> to reference it in repl argument. Like so:

res = re.sub(r"([\w.'%]+)\s+\1", r"S \g<1>", s)

Upvotes: 1

Alexander Riedel
Alexander Riedel

Reputation: 1359

There's this solution iterating through the list of words, finding duplicates and replacing the first occurency of each duplicate wirh "S"

s = """(S (S (S (S (Skipper Skipper) ('s 's)) (Inc. Inc.)) (S (Bellevue Bellevue) (Wash. Wash.))) 
     (S (said said) (S (it it) (S (signed signed) (S (a a) (S (definitive definitive) 
     (S (merger merger) (S (agreement agreement) (S (for for) (S (S (a a) 
     (S (National National) (S (Pizza Pizza) (S (Corp. Corp.) (unit unit))))) 
     (S (to to) (S (acquire acquire) (S (S (S (the the) (S (90.6 90.6) (% %))) 
     (S (S (of of) (S (S (Skipper Skipper) ('s 's)) (Inc. Inc.))) (S (it it) 
     (S (does does) (S (n't n't) (own own)))))) (S (for for) (S (S (11.50 11.50) 
     (S (a a) (share share))) (S (or or) (S (about about) (S (28.1 28.1) (million million)))))))))))))))))))"""

word_list = s.split()

for word, next_word in zip(word_list, word_list[1:]):
    if word.replace('(', '').replace(')', '') == next_word.replace('(', '').replace(')', ''):
        word_list[word_list.index(word)] = "(S"
        

s_new = " ".join(word_list)

Upvotes: 1

jizhihaoSAMA
jizhihaoSAMA

Reputation: 12672

Use re.sub to replace them:

import re

def sub(matched):
    return f"(S {matched.group(2)})" if matched.group(1) == matched.group(2) else str(matched.groups())

s = '''(S (S (S (S (Skipper Skipper) ('s 's)) (Inc. Inc.)) (S (Bellevue Bellevue) (Wash. Wash.))) 
     (S (said said) (S (it it) (S (signed signed) (S (a a) (S (definitive definitive) 
     (S (merger merger) (S (agreement agreement) (S (for for) (S (S (a a) 
     (S (National National) (S (Pizza Pizza) (S (Corp. Corp.) (unit unit))))) 
     (S (to to) (S (acquire acquire) (S (S (S (the the) (S (90.6 90.6) (% %))) 
     (S (S (of of) (S (S (Skipper Skipper) ('s 's)) (Inc. Inc.))) (S (it it) 
     (S (does does) (S (n't n't) (own own)))))) (S (for for) (S (S (11.50 11.50) 
     (S (a a) (share share))) (S (or or) (S (about about) (S (28.1 28.1) (million million)))))))))))))))))))'''

result = re.sub(r"\(([\.\%\'\w\d]+) ([\.\%\'\w\d]+)\)", sub, s)

Upvotes: 1

AnsFourtyTwo
AnsFourtyTwo

Reputation: 2518

You might want to look into regular expressions here. I've created a demo which will match all inner brackets.

Having those, you can analyize the content for each of those matches and replace it according to your requirements:

import re

s = "(S (S (S (S (Skipper Skipper) ('s 's)) (Inc. Inc.)) (S (Bellevue Bellevue) (Wash. Wash.))) \
     (S (said said) (S (it it) (S (signed signed) (S (a a) (S (definitive definitive) \
     (S (merger merger) (S (agreement agreement) (S (for for) (S (S (a a) \
     (S (National National) (S (Pizza Pizza) (S (Corp. Corp.) (unit unit))))) \
     (S (to to) (S (acquire acquire) (S (S (S (the the) (S (90.6 90.6) (% %))) \
     (S (S (of of) (S (S (Skipper Skipper) ('s 's)) (Inc. Inc.))) (S (it it) \
     (S (does does) (S (n't n't) (own own)))))) (S (for for) (S (S (11.50 11.50) \
     (S (a a) (share share))) (S (or or) (S (about about) (S (28.1 28.1) (million million)))))))))))))))))))"

# Finding all inner brackets:
# - (Skipper Skipper)
# - ('s 's)
# - etc.
double_words = re.findall(r"(\((?:\(??[^\(]*?\)))", s)


for double_word in double_words:
    words = double_word.lstrip("(").rstrip(")").split()
    # First and second word are the same
    if words[0]==words[1]:
        # Replace ('s 's) with (S 's)
        s = s.replace(double_word, f'(S {words[0]})')
        
print(s)

Output

(S (S (S (S (S Skipper) (S 's)) (S Inc.)) (S (S Bellevue) (S Wash.)))      (S (S said) (S (S it) (S (S signed) (S (S a) (S (S definitive)      (S (S merger) (S (S agreement) (S (S for) (S (S (S a)      (S (S National) (S (S Pizza) (S (S Corp.) (S unit)))))      (S (S to) (S (S acquire) (S (S (S (S the) (S (S 90.6) (S %)))      (S (S (S of) (S (S (S Skipper) (S 's)) (S Inc.))) (S (S it)      (S (S does) (S (S n't) (S own)))))) (S (S for) (S (S (S 11.50)      (S (S a) (S share))) (S (S or) (S (S about) (S (S 28.1) (S million)))))))))))))))))))

Upvotes: 1

Related Questions