Bjorn
Bjorn

Reputation: 127

regex for repeating words in a string in Python

I have a good regexp for replacing repeating characters in a string. But now I also need to replace repeating words, three or more word will be replaced by two words.

Like

bye! bye! bye!

should become

bye! bye!

My code so far:

def replaceThreeOrMoreCharachetrsWithTwoCharacters(string): 
     # pattern to look for three or more repetitions of any character, including newlines. 
     pattern = re.compile(r"(.)\1{2,}", re.DOTALL) 
     return pattern.sub(r"\1\1", string)

Upvotes: 6

Views: 5759

Answers (5)

Bjorn
Bjorn

Reputation: 127

def replaceThreeOrMoreWordsWithTwoWords(string):
    # Pattern to look for three or more repetitions of any words.
    pattern = re.compile(r"(?<!\S)((\S+)(?:\s+\2))(?:\s+\2)+(?!\S)", re.DOTALL)
    return pattern.sub(r"\1", string)

Upvotes: 0

Avinash Raj
Avinash Raj

Reputation: 174696

You could try the below regex also,

(?<= |^)(\S+)(?: \1){2,}(?= |$)

Sample code,

>>> import regex
>>> s = "hi hi hi hi some words words words which'll repeat repeat repeat repeat repeat"
>>> m = regex.sub(r'(?<= |^)(\S+)(?: \1){2,}(?= |$)', r'\1 \1', s)
>>> m
"hi hi some words words which'll repeat repeat"

DEMO

Upvotes: 3

Tom Fenech
Tom Fenech

Reputation: 74596

I know you were after a regular expression but you could use a simple loop to achieve the same thing:

def max_repeats(s, max=2):
  last = ''
  out = []
  for word in s.split():
    same = 0 if word != last else same + 1
    if same < max: out.append(word)
    last = word
  return ' '.join(out)

As a bonus, I have allowed a different maximum number of repeats to be specified (the default is 2). If there is more than one space between each word, it will be lost. It's up to you whether you consider that to be a bug or a feature :)

Upvotes: 2

Casimir et Hippolyte
Casimir et Hippolyte

Reputation: 89547

Assuming that what is called "word" in your requirements is one or more non-whitespaces characters surrounded by whitespaces or string limits, you can try this pattern:

re.sub(r'(?<!\S)((\S+)(?:\s+\2))(?:\s+\2)+(?!\S)', r'\1', s)

Upvotes: 5

hjpotter92
hjpotter92

Reputation: 80629

Try the following:

import re
s = your string
s = re.sub( r'(\S+) (?:\1 ?){2,}', r'\1 \1', s )

You can see a sample code here: http://codepad.org/YyS9JCLO

Upvotes: 0

Related Questions