user1810023
user1810023

Reputation: 19

Python 3x regular expression syntax

While trying to remove all repeating words in a string in an example below, what should be the correct syntax to check for 1 or more repetition of the word. The following example returns

cat cat in the hat hat hat

it ignores more than one repetition in the string, only removes "in" & "the" which have been repeated only once.

>>> re.sub(r'(\b[a-z]+) \1', r'\1', 'cat cat cat in in the the hat hat hat hat hat hat')

Upvotes: 0

Views: 161

Answers (5)

dannymilsom
dannymilsom

Reputation: 2406

A non regex alternative when order isn't important would be

" ".join(set(string_with_duplicates.split()))

This first splits the string by whitespace, turns the returned list into a set (which removes duplicates, as each element is unique), and then joins these items back into a string.

>>> string_with_duplicates = 'cat cat cat in in the the hat hat hat hat hat hat'
>>> " ".join(set(string_with_duplicates.split()))
'the in hat cat'

If the order of the words needs to be preserved, you could write something like this

>>> unique = []
>>> for w in string_of_duplicates.split():
        if not w in unique:
        unique.append(w)
>>> " ".join(unique)
'cat in the hat'

Upvotes: 0

Lily Mara
Lily Mara

Reputation: 4138

This should print the given sentence with duplicates

check_for_repeats = 'cat cat cat in in the the hat hat hat hat hat hat'
words = check_for_repeats.split()
sentence_array = []

for i in enumerate(words[:-1]):
    if i[1] != words[i[0] + 1]:
        sentence_array.append(i[1])
if words[-1:] != words[-2:]:
    sentence_array.append(words[-1:][0])

sentence = ' '.join(sentence_array)
print(sentence)

Upvotes: 1

Casimir et Hippolyte
Casimir et Hippolyte

Reputation: 89574

You can use this:

re.sub(r'(\b[a-z]+) (?=\1\b)', '', 'cat cat cat in in the the hat hat hat hat hat hat')

Upvotes: 0

Barmar
Barmar

Reputation: 781721

Try this:

re.sub(r'(\b[a-z]+)(?: \1)+', r'\1', 'cat cat cat in in the the hat hat hat hat hat hat')

The repetition operator after the back-reference will make it match multiple repetitions.

Upvotes: 0

Sam
Sam

Reputation: 20486

Try this regex:

(\b[a-z]+)(?: \1)+

What I had to do is put your \1 into a non-capturing group so that we could repeat it 1+ times. Then we can replace it the same way you did:

re.sub(r'(\b[a-z]+)(?: \1)', r'\1', 'cat cat cat in in the the hat hat hat hat hat hat')

Upvotes: 0

Related Questions