Reputation: 411
I want to remove duplicated words inside brackets and replace them with "S" + word.
For eg:
(Skipper Skipper) -> (S Skipper)
('s 's) -> (S 's)
Here is the string, s
:
s = "(S (S (S (S (Skipper Skipper) ('s 's)) (Inc. Inc.)) (S (Bellevue Bellevue) (Wash. Wash.)))
(S (said said) (S (it it) (S (signed signed) (S (a a) (S (definitive definitive)
(S (merger merger) (S (agreement agreement) (S (for for) (S (S (a a)
(S (National National) (S (Pizza Pizza) (S (Corp. Corp.) (unit unit)))))
(S (to to) (S (acquire acquire) (S (S (S (the the) (S (90.6 90.6) (% %)))
(S (S (of of) (S (S (Skipper Skipper) ('s 's)) (Inc. Inc.))) (S (it it)
(S (does does) (S (n't n't) (own own)))))) (S (for for) (S (S (11.50 11.50)
(S (a a) (share share))) (S (or or) (S (about about) (S (28.1 28.1) (million million)))))))))))))))))))"
Expected result:
out = "(S (S (S (S (S Skipper) (S 's)) (S Inc.)) (S (S Bellevue) (S Wash.)))
(S (S said) (S (S it) (S (S signed) (S (S a) (S (S definitive) (S (S merger)
(S (S agreement) (S (S for) (S (S (S a) (S (S National) (S (S Pizza) (S (S Corp.)
(S unit))))) (S (S to) (S (S acquire) (S (S (S (S the) (S (S 90.6) (S %))) (S (S (S of)
(S (S (S Skipper) (S 's)) (S Inc.))) (S (S it) (S (S does) (S (S n't) (S own))))))
(S (S for) (S (S (S 11.50) (S (S a) (S share))) (S (S or) (S (S about) (S (S 28.1) (S million)))))))))))))))))))"
I tried to do:
from collections import Counter
lst = s.lstrip("(").rstrip(")").replace("(", "").replace(")", "").split()
d = Counter(lst)
mapper = {((k + " ") * v).strip():"S" + " " + k for k, v in d.items()}
for k, v in mapper.items():
out = s.replace(k, v)
But not getting quite right:
out = "(S (S (S (S (Skipper Skipper) ('s 's)) (Inc. Inc.)) (S (S Bellevue) (S Wash.)))
(S (S said) (S (it it) (S (S signed) (S (a a) (S (S definitive) (S (S merger)
(S (S agreement) (S (for for) (S (S (a a) (S (S National) (S (S Pizza) (S (S Corp.)
(S unit))))) (S (S to) (S (S acquire) (S (S (S (S the) (S (S 90.6) (S %))) (S (S (S of)
(S (S (Skipper Skipper) ('s 's)) (Inc. Inc.))) (S (it it) (S (S does) (S (S n't) (S own))))))
(S (for for) (S (S (S 11.50) (S (a a) (S share))) (S (S or) (S (S about) (S (S 28.1) (S million)))))))))))))))))))"
Upvotes: 1
Views: 93
Reputation: 32153
You can use re.sub and backreferences in regular expression.
For finding duplicate words you can use \1
that references the captured match of the first group, and \g<1>
to reference it in repl
argument. Like so:
res = re.sub(r"([\w.'%]+)\s+\1", r"S \g<1>", s)
Upvotes: 1
Reputation: 1359
There's this solution iterating through the list of words, finding duplicates and replacing the first occurency of each duplicate wirh "S"
s = """(S (S (S (S (Skipper Skipper) ('s 's)) (Inc. Inc.)) (S (Bellevue Bellevue) (Wash. Wash.)))
(S (said said) (S (it it) (S (signed signed) (S (a a) (S (definitive definitive)
(S (merger merger) (S (agreement agreement) (S (for for) (S (S (a a)
(S (National National) (S (Pizza Pizza) (S (Corp. Corp.) (unit unit)))))
(S (to to) (S (acquire acquire) (S (S (S (the the) (S (90.6 90.6) (% %)))
(S (S (of of) (S (S (Skipper Skipper) ('s 's)) (Inc. Inc.))) (S (it it)
(S (does does) (S (n't n't) (own own)))))) (S (for for) (S (S (11.50 11.50)
(S (a a) (share share))) (S (or or) (S (about about) (S (28.1 28.1) (million million)))))))))))))))))))"""
word_list = s.split()
for word, next_word in zip(word_list, word_list[1:]):
if word.replace('(', '').replace(')', '') == next_word.replace('(', '').replace(')', ''):
word_list[word_list.index(word)] = "(S"
s_new = " ".join(word_list)
Upvotes: 1
Reputation: 12672
Use re.sub
to replace them:
import re
def sub(matched):
return f"(S {matched.group(2)})" if matched.group(1) == matched.group(2) else str(matched.groups())
s = '''(S (S (S (S (Skipper Skipper) ('s 's)) (Inc. Inc.)) (S (Bellevue Bellevue) (Wash. Wash.)))
(S (said said) (S (it it) (S (signed signed) (S (a a) (S (definitive definitive)
(S (merger merger) (S (agreement agreement) (S (for for) (S (S (a a)
(S (National National) (S (Pizza Pizza) (S (Corp. Corp.) (unit unit)))))
(S (to to) (S (acquire acquire) (S (S (S (the the) (S (90.6 90.6) (% %)))
(S (S (of of) (S (S (Skipper Skipper) ('s 's)) (Inc. Inc.))) (S (it it)
(S (does does) (S (n't n't) (own own)))))) (S (for for) (S (S (11.50 11.50)
(S (a a) (share share))) (S (or or) (S (about about) (S (28.1 28.1) (million million)))))))))))))))))))'''
result = re.sub(r"\(([\.\%\'\w\d]+) ([\.\%\'\w\d]+)\)", sub, s)
Upvotes: 1
Reputation: 2518
You might want to look into regular expressions here. I've created a demo which will match all inner brackets.
Having those, you can analyize the content for each of those matches and replace it according to your requirements:
import re
s = "(S (S (S (S (Skipper Skipper) ('s 's)) (Inc. Inc.)) (S (Bellevue Bellevue) (Wash. Wash.))) \
(S (said said) (S (it it) (S (signed signed) (S (a a) (S (definitive definitive) \
(S (merger merger) (S (agreement agreement) (S (for for) (S (S (a a) \
(S (National National) (S (Pizza Pizza) (S (Corp. Corp.) (unit unit))))) \
(S (to to) (S (acquire acquire) (S (S (S (the the) (S (90.6 90.6) (% %))) \
(S (S (of of) (S (S (Skipper Skipper) ('s 's)) (Inc. Inc.))) (S (it it) \
(S (does does) (S (n't n't) (own own)))))) (S (for for) (S (S (11.50 11.50) \
(S (a a) (share share))) (S (or or) (S (about about) (S (28.1 28.1) (million million)))))))))))))))))))"
# Finding all inner brackets:
# - (Skipper Skipper)
# - ('s 's)
# - etc.
double_words = re.findall(r"(\((?:\(??[^\(]*?\)))", s)
for double_word in double_words:
words = double_word.lstrip("(").rstrip(")").split()
# First and second word are the same
if words[0]==words[1]:
# Replace ('s 's) with (S 's)
s = s.replace(double_word, f'(S {words[0]})')
print(s)
(S (S (S (S (S Skipper) (S 's)) (S Inc.)) (S (S Bellevue) (S Wash.))) (S (S said) (S (S it) (S (S signed) (S (S a) (S (S definitive) (S (S merger) (S (S agreement) (S (S for) (S (S (S a) (S (S National) (S (S Pizza) (S (S Corp.) (S unit))))) (S (S to) (S (S acquire) (S (S (S (S the) (S (S 90.6) (S %))) (S (S (S of) (S (S (S Skipper) (S 's)) (S Inc.))) (S (S it) (S (S does) (S (S n't) (S own)))))) (S (S for) (S (S (S 11.50) (S (S a) (S share))) (S (S or) (S (S about) (S (S 28.1) (S million)))))))))))))))))))
Upvotes: 1