Reputation: 411
I want to remove duplicated words inside brackets and replace them with "S" + word. The words can be anything - including special characters, decimals, time-periods, hyphenated words etc.
For eg:
(Skipper Skipper) -> (S Skipper)
('s 's) -> (S 's)
Here is the string, s
:
s = "(S (S (S (S (Skipper Skipper) ('s 's)) (Inc. Inc.)) (S (Bellevue Bellevue) (Wash. Wash.)))
(S (said said) (S (it it) (S (signed signed) (S (a a) (S (definitive definitive)
(S (merger merger) (S (agreement agreement) (S (for for) (S (S (a a)
(S (National National) (S (Pizza Pizza) (S (Corp. Corp.) (unit unit)))))
(S (to to) (S (acquire acquire) (S (S (S (the the) (S (90.6 90.6) (% %)))
(S (S (of of) (S (S (Skipper Skipper) ('s 's)) (Inc. Inc.))) (S (it it)
(S (does does) (S (n't n't) (own own)))))) (S (for for) (S (S (11.50 11.50)
(S (a a) (share share))) (S (or or) (S (about about) (S (28.1 28.1) (million million)))))))))))))))))))"
Expected result:
out = "(S (S (S (S (S Skipper) (S 's)) (S Inc.)) (S (S Bellevue) (S Wash.)))
(S (S said) (S (S it) (S (S signed) (S (S a) (S (S definitive) (S (S merger)
(S (S agreement) (S (S for) (S (S (S a) (S (S National) (S (S Pizza) (S (S Corp.)
(S unit))))) (S (S to) (S (S acquire) (S (S (S (S the) (S (S 90.6) (S %))) (S (S (S of)
(S (S (S Skipper) (S 's)) (S Inc.))) (S (S it) (S (S does) (S (S n't) (S own))))))
(S (S for) (S (S (S 11.50) (S (S a) (S share))) (S (S or) (S (S about) (S (S 28.1) (S million)))))))))))))))))))"
I tried to do:
import re
def sub(matched):
return f"(S {matched.group(2)})" if matched.group(1) == matched.group(2) else str(matched.groups())
result = re.sub(r"\(([\.\%\'\w\d]+) ([\.\%\'\w\d]+)\)", sub, s)
But I need to input words per type (/d, /w) etc. Is there a one-shot way to achieve this?
Upvotes: 2
Views: 55
Reputation: 20737
This would do it:
\(([^()]+?) +\1\)
and your substitution would be (S \1)
https://regex101.com/r/3CUxC6/1
Upvotes: 0
Reputation: 61910
As you want to match duplicates inside parenthesis, you could do:
import re
s = """(S (S (S (S (Skipper Skipper) ('s 's)) (Inc. Inc.)) (S (Bellevue Bellevue) (Wash. Wash.)))
(S (said said) (S (it it) (S (signed signed) (S (a a) (S (definitive definitive)
(S (merger merger) (S (agreement agreement) (S (for for) (S (S (a a)
(S (National National) (S (Pizza Pizza) (S (Corp. Corp.) (unit unit)))))
(S (to to) (S (acquire acquire) (S (S (S (the the) (S (90.6 90.6) (% %)))
(S (S (of of) (S (S (Skipper Skipper) ('s 's)) (Inc. Inc.))) (S (it it)
(S (does does) (S (n't n't) (own own)))))) (S (for for) (S (S (11.50 11.50)
(S (a a) (share share))) (S (or or) (S (about about) (S (28.1 28.1) (million million)))))))))))))))))))"""
res = re.sub(r'\((\S+)\s+\1\)', r'(S \1)', s)
print(res)
Output
(S (S (S (S (Skipper Skipper) ('s 's)) (Inc. Inc.)) (S (Bellevue Bellevue) (Wash. Wash.)))
(S (said said) (S (it it) (S (signed signed) (S (a a) (S (definitive definitive)
(S (merger merger) (S (agreement agreement) (S (for for) (S (S (a a)
(S (National National) (S (Pizza Pizza) (S (Corp. Corp.) (unit unit)))))
(S (to to) (S (acquire acquire) (S (S (S (the the) (S (90.6 90.6) (% %)))
(S (S (of of) (S (S (Skipper Skipper) ('s 's)) (Inc. Inc.))) (S (it it)
(S (does does) (S (n't n't) (own own)))))) (S (for for) (S (S (11.50 11.50)
(S (a a) (share share))) (S (or or) (S (about about) (S (28.1 28.1) (million million)))))))))))))))))))
The regex:
\(
matches an open parenthesis(\S+)
matches a group of one or more non white-spaces (puts them in a capture group)\s+
matches one or more white-spaces\1
a back-reference to the first capture group forcing to match the exact same text\)
matches a close parenthesisUpvotes: 0
Reputation: 626845
You can use
(?<![^\s()])([^\s()]+)(?=\s+\1(?![^\s()]))
See the regex demo. Details:
(?<![^\s()])
- a negative lookahead that matches a location that is not immediately preceded with a char other than a whitespace, (
and )
([^\s()]+)
- Group 1: one or more chars other than a whitespace, (
and )
(?=\s+\1(?![^\s()]))
- a positive lookahead that matches a location that is immediately followed with
\s+
- 1 or more whitespaces\1
- Group 1 value(?![^\s()])
- there must be no char other than a whitespace, (
and )
immediately to the right of the current location.In Python, use
re.sub(r'(?<![^\s()])([^\s()]+)(?=\s+\1(?![^\s()]))', 'S', text)
Upvotes: 1