nikinlpds
nikinlpds

Reputation: 411

Replace previous occurrence of word

I want to remove duplicated words inside brackets and replace them with "S" + word. The words can be anything - including special characters, decimals, time-periods, hyphenated words etc.

For eg:

(Skipper Skipper) -> (S Skipper)
('s 's) -> (S 's)

Here is the string, s:

s = "(S (S (S (S (Skipper Skipper) ('s 's)) (Inc. Inc.)) (S (Bellevue Bellevue) (Wash. Wash.))) 
     (S (said said) (S (it it) (S (signed signed) (S (a a) (S (definitive definitive) 
     (S (merger merger) (S (agreement agreement) (S (for for) (S (S (a a) 
     (S (National National) (S (Pizza Pizza) (S (Corp. Corp.) (unit unit))))) 
     (S (to to) (S (acquire acquire) (S (S (S (the the) (S (90.6 90.6) (% %))) 
     (S (S (of of) (S (S (Skipper Skipper) ('s 's)) (Inc. Inc.))) (S (it it) 
     (S (does does) (S (n't n't) (own own)))))) (S (for for) (S (S (11.50 11.50) 
     (S (a a) (share share))) (S (or or) (S (about about) (S (28.1 28.1) (million million)))))))))))))))))))"

Expected result:

out = "(S (S (S (S (S Skipper) (S 's)) (S Inc.)) (S (S Bellevue) (S Wash.))) 
       (S (S said) (S (S it) (S (S signed) (S (S a) (S (S definitive) (S (S merger) 
       (S (S agreement) (S (S for) (S (S (S a) (S (S National) (S (S Pizza) (S (S Corp.) 
       (S unit))))) (S (S to) (S (S acquire) (S (S (S (S the) (S (S 90.6) (S %))) (S (S (S of) 
       (S (S (S Skipper) (S 's)) (S Inc.))) (S (S it) (S (S does) (S (S n't) (S own)))))) 
       (S (S for) (S (S (S 11.50) (S (S a) (S share))) (S (S or) (S (S about) (S (S 28.1) (S million)))))))))))))))))))"

I tried to do:

import re

def sub(matched):
    return f"(S {matched.group(2)})" if matched.group(1) == matched.group(2) else str(matched.groups())

result = re.sub(r"\(([\.\%\'\w\d]+) ([\.\%\'\w\d]+)\)", sub, s)

But I need to input words per type (/d, /w) etc. Is there a one-shot way to achieve this?

Upvotes: 2

Views: 55

Answers (3)

MonkeyZeus
MonkeyZeus

Reputation: 20737

This would do it:

\(([^()]+?) +\1\)

and your substitution would be (S \1)

https://regex101.com/r/3CUxC6/1

Upvotes: 0

Dani Mesejo
Dani Mesejo

Reputation: 61910

As you want to match duplicates inside parenthesis, you could do:

import re

s = """(S (S (S (S (Skipper Skipper) ('s 's)) (Inc. Inc.)) (S (Bellevue Bellevue) (Wash. Wash.))) 
     (S (said said) (S (it it) (S (signed signed) (S (a a) (S (definitive definitive) 
     (S (merger merger) (S (agreement agreement) (S (for for) (S (S (a a) 
     (S (National National) (S (Pizza Pizza) (S (Corp. Corp.) (unit unit))))) 
     (S (to to) (S (acquire acquire) (S (S (S (the the) (S (90.6 90.6) (% %))) 
     (S (S (of of) (S (S (Skipper Skipper) ('s 's)) (Inc. Inc.))) (S (it it) 
     (S (does does) (S (n't n't) (own own)))))) (S (for for) (S (S (11.50 11.50) 
     (S (a a) (share share))) (S (or or) (S (about about) (S (28.1 28.1) (million million)))))))))))))))))))"""

res = re.sub(r'\((\S+)\s+\1\)', r'(S \1)', s)
print(res)

Output

(S (S (S (S (Skipper Skipper) ('s 's)) (Inc. Inc.)) (S (Bellevue Bellevue) (Wash. Wash.))) 
     (S (said said) (S (it it) (S (signed signed) (S (a a) (S (definitive definitive) 
     (S (merger merger) (S (agreement agreement) (S (for for) (S (S (a a) 
     (S (National National) (S (Pizza Pizza) (S (Corp. Corp.) (unit unit))))) 
     (S (to to) (S (acquire acquire) (S (S (S (the the) (S (90.6 90.6) (% %))) 
     (S (S (of of) (S (S (Skipper Skipper) ('s 's)) (Inc. Inc.))) (S (it it) 
     (S (does does) (S (n't n't) (own own)))))) (S (for for) (S (S (11.50 11.50) 
     (S (a a) (share share))) (S (or or) (S (about about) (S (28.1 28.1) (million million)))))))))))))))))))

The regex:

  • \( matches an open parenthesis
  • (\S+) matches a group of one or more non white-spaces (puts them in a capture group)
  • \s+ matches one or more white-spaces
  • \1 a back-reference to the first capture group forcing to match the exact same text
  • \) matches a close parenthesis

Upvotes: 0

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626845

You can use

(?<![^\s()])([^\s()]+)(?=\s+\1(?![^\s()]))

See the regex demo. Details:

  • (?<![^\s()]) - a negative lookahead that matches a location that is not immediately preceded with a char other than a whitespace, ( and )
  • ([^\s()]+) - Group 1: one or more chars other than a whitespace, ( and )
  • (?=\s+\1(?![^\s()])) - a positive lookahead that matches a location that is immediately followed with
    • \s+ - 1 or more whitespaces
    • \1 - Group 1 value
    • (?![^\s()]) - there must be no char other than a whitespace, ( and ) immediately to the right of the current location.

In Python, use

re.sub(r'(?<![^\s()])([^\s()]+)(?=\s+\1(?![^\s()]))', 'S', text)

Upvotes: 1

Related Questions