Xiaoyi Zhang
Xiaoyi Zhang

Reputation: 37

How to use regex to only keep first n repeated words

If I have an input sentence

input = 'ok ok, it is very very very very very hard'

and what I want to do is to only keep the first three replica for any repeated word:

output = 'ok ok, it is very very very hard'

How can I achieve this with re or regex module in python?

Upvotes: 0

Views: 80

Answers (3)

The fourth bird
The fourth bird

Reputation: 163352

One option could be to use a capturing group with a backreference and use that in the replacement.

((\w+)(?: \2){2})(?: \2)*

Explanation

  • ( Capture group 1
    • (\w+) capture group 2, match 1+ word chars (The example data only uses word characters. To make sure they are no part of a larger word use a word boundary \b)
    • (?: \2){2} Repeat 2 times matching a space and a backreference to group 2. Instead of a single space you could use [ \t]+ to match 1+ spaces or tabs or use \s+ to match 1+ whitespace chars. (Note that that would also match a newline)
  • ) Close group 1
  • (?: \2)* Match 0+ times a space and a backreference to group 2 to match the same words that you want to remove

Regex demo | Python demo

For example

import re

regex = r"((\w+)(?: \2){2})(?: \2)*"
s = "ok ok, it is very very very very very hard"
result = re.sub(regex, r"\1", s)

if result:
    print (result)

Result

ok ok, it is very very very hard

Upvotes: 1

Andrej Kesely
Andrej Kesely

Reputation: 195438

One solution with re.sub with custom function:

s = 'ok ok, it is very very very very very hard'

def replace(n=3):
    last_word, cnt = '', 0
    current_word = yield

    while True:
        if last_word == current_word:
            cnt += 1
        else:
            cnt = 0

        last_word = current_word

        if cnt >= n:
            current_word = yield ''
        else:
            current_word = yield current_word

import re

replacer = replace()
next(replacer)
print(re.sub(r'\s*[\w]+\s*', lambda g: replacer.send(g.group(0)), s))

Prints:

ok ok, it is very very very hard

Upvotes: 0

blhsing
blhsing

Reputation: 106553

You can group a word and use a backreference to refer to it to ensure that it repeats for more than 2 times:

import re
print(re.sub(r'\b((\w+)(?:\s+\2){2})(?:\s+\2)+\b', r'\1', input))

This outputs:

ok ok, it is very very very hard

Upvotes: 1

Related Questions