Reputation: 37
If I have an input sentence
input = 'ok ok, it is very very very very very hard'
and what I want to do is to only keep the first three replica for any repeated word:
output = 'ok ok, it is very very very hard'
How can I achieve this with re
or regex
module in python?
Upvotes: 0
Views: 80
Reputation: 163352
One option could be to use a capturing group with a backreference and use that in the replacement.
((\w+)(?: \2){2})(?: \2)*
Explanation
(
Capture group 1
(\w+)
capture group 2, match 1+ word chars (The example data only uses word characters. To make sure they are no part of a larger word use a word boundary \b
)(?: \2){2}
Repeat 2 times matching a space and a backreference to group 2. Instead of a single space you could use [ \t]+
to match 1+ spaces or tabs or use \s+
to match 1+ whitespace chars. (Note that that would also match a newline))
Close group 1(?: \2)*
Match 0+ times a space and a backreference to group 2 to match the same words that you want to removeFor example
import re
regex = r"((\w+)(?: \2){2})(?: \2)*"
s = "ok ok, it is very very very very very hard"
result = re.sub(regex, r"\1", s)
if result:
print (result)
Result
ok ok, it is very very very hard
Upvotes: 1
Reputation: 195438
One solution with re.sub
with custom function:
s = 'ok ok, it is very very very very very hard'
def replace(n=3):
last_word, cnt = '', 0
current_word = yield
while True:
if last_word == current_word:
cnt += 1
else:
cnt = 0
last_word = current_word
if cnt >= n:
current_word = yield ''
else:
current_word = yield current_word
import re
replacer = replace()
next(replacer)
print(re.sub(r'\s*[\w]+\s*', lambda g: replacer.send(g.group(0)), s))
Prints:
ok ok, it is very very very hard
Upvotes: 0
Reputation: 106553
You can group a word and use a backreference to refer to it to ensure that it repeats for more than 2 times:
import re
print(re.sub(r'\b((\w+)(?:\s+\2){2})(?:\s+\2)+\b', r'\1', input))
This outputs:
ok ok, it is very very very hard
Upvotes: 1