Bold O
Bold O

Reputation: 11

Replace sequence of same characters

What is the fastest way in Python to replace sequence of 3 and more same characters in utf-8 text?I need to replace sequence of 3 and more same characters with exact 2 characters. I.e.

aaa -> aa 
bbbb -> bb
abbbcd -> abbcd
124xyyyz3 -> 124xyyz3

Upvotes: 1

Views: 2247

Answers (3)

Jon Clements
Jon Clements

Reputation: 142206

Although for this specific case I would go for a regular expression, you could also make this generic to operator on arbitrary sequences, eg:

from itertools import groupby, chain, islice

s = 'abaaaaaabbbbbbbbcdcddddde'
print ''.join(chain.from_iterable(islice(g, 2) for k, g in groupby(s)))
# abaabbcdcdde

Upvotes: 1

Goran Rakic
Goran Rakic

Reputation: 1799

You can use regular expression:

import re
re.sub(r'(.)\1{2,}', r'\1\1', 'bbbbbaaacc')

Pattern captures any character followed by the same character repeated two or more times and matches this substring. The replacement replaces a matched substring with just two of the captured character. Dot will not replace repeated new lines, use (.|\n) or re.DOTALL flag for that.

It works with Unicode too:

re.sub(r'(.)\1{2,}', r'\1\1', u'жжж')

And if you have a string variable x containing utf-8 text, use x.decode('utf-8').

Upvotes: 7

jamylak
jamylak

Reputation: 133634

>>> import re
>>> re.sub(r'(\w)\1{2,}', r'\1\1', 'aaa')
'aa'
>>> re.sub(r'(\w)\1{2,}', r'\1\1', 'bbbb')
'bb'

Upvotes: 12

Related Questions