Reputation: 11
What is the fastest way in Python to replace sequence of 3 and more same characters in utf-8 text?I need to replace sequence of 3 and more same characters with exact 2 characters. I.e.
aaa -> aa
bbbb -> bb
abbbcd -> abbcd
124xyyyz3 -> 124xyyz3
Upvotes: 1
Views: 2247
Reputation: 142206
Although for this specific case I would go for a regular expression, you could also make this generic to operator on arbitrary sequences, eg:
from itertools import groupby, chain, islice
s = 'abaaaaaabbbbbbbbcdcddddde'
print ''.join(chain.from_iterable(islice(g, 2) for k, g in groupby(s)))
# abaabbcdcdde
Upvotes: 1
Reputation: 1799
You can use regular expression:
import re
re.sub(r'(.)\1{2,}', r'\1\1', 'bbbbbaaacc')
Pattern captures any character followed by the same character repeated two or more times and matches this substring. The replacement replaces a matched substring with just two of the captured character. Dot will not replace repeated new lines, use (.|\n)
or re.DOTALL
flag for that.
It works with Unicode too:
re.sub(r'(.)\1{2,}', r'\1\1', u'жжж')
And if you have a string variable x
containing utf-8 text, use x.decode('utf-8')
.
Upvotes: 7
Reputation: 133634
>>> import re
>>> re.sub(r'(\w)\1{2,}', r'\1\1', 'aaa')
'aa'
>>> re.sub(r'(\w)\1{2,}', r'\1\1', 'bbbb')
'bb'
Upvotes: 12