Reputation: 359
I am going through almost 120 Billion string combinations. I am trying to find the most speed optimized way of determining if the string in question has 3 (or more) sequential duplicate characters.
Ex:
string = "blah"
The test should return false.
string = "blaaah"
This would return true.
I successfully implemented a basic for loop that looped through each string's characters and compared the next character for a match. This worked, but for the quantity of strings I am filtering through I really would like to optimize it.
Any suggestions? Thanks!
Upvotes: 2
Views: 5134
Reputation: 1121406
You could instead use itertools.groupby()
here. You'll still have to scan through the string, but so would a regex:
from itertools import groupby
three_or_more = (char for char, group in groupby(input_string)
if sum(1 for _ in group) >= 3)
This produces a generator; iterate over it to list all characters that are found 3 or more times, or use any()
to see if there is at least one such group:
if any(three_or_more):
# found at least one group of consecutive characters that
# consists of 3 or more.
Unfortunately for me, the re
solution is more efficient here:
>>> from timeit import timeit
>>> import random
>>> from itertools import groupby
>>> import re
>>> import string
>>> def consecutive_groupby(string):
... three_or_more = (char for char, group in groupby(string)
... if sum(1 for _ in group) >= 3)
... return any(three_or_more)
...
>>> def consecutive_re(string):
... return re.search(r'(.)\1\1', string) is not None
...
>>> # worst-case: random data with no consecutive strings
...
>>> test_string = ''.join([random.choice(string.ascii_letters) for _ in range(1000)])
>>> consecutive_re(test_string), consecutive_groupby(test_string)
(False, False)
>>> timeit('consecutive(s)', 'from __main__ import test_string as s, consecutive_re as consecutive', number=10000)
0.19730806350708008
>>> timeit('consecutive(s)', 'from __main__ import test_string as s, consecutive_groupby as consecutive', number=10000)
4.633949041366577
>>> # insert repeated characters
...
>>> test_string_with_repeat = test_string[:100] + 'aaa' + test_string[100:]
>>> consecutive_re(test_string_with_repeat), consecutive_groupby(test_string_with_repeat)
(True, True)
>>> timeit('consecutive(s)', 'from __main__ import test_string_with_repeat as s, consecutive_re as consecutive', number=10000)
0.03344106674194336
>>> timeit('consecutive(s)', 'from __main__ import test_string_with_repeat as s, consecutive_groupby as consecutive', number=10000)
0.4827418327331543
The regular expression approach given by Avinash is the clear winner here, which goes to show you should always measure alternatives.
Upvotes: 4
Reputation: 47159
You could define a capture group pattern, then search for it repeated:
import re
s = 'blaaah'
p = '(?P<g>.)(?P=g){2}'
m = re.search(p, s, re.M)
print(m).group(0)
Result:
aaa
Upvotes: 1
Reputation: 174696
Through re
module.
>>> def consecutive(string):
if re.search(r'(.)\1\1', string):
print('True')
else:
print('False')
>>> consecutive('blah')
False
>>> consecutive('blaah')
False
>>> consecutive('blaaah')
True
>>> consecutive('blaaaah')
True
()
called capturing group which is used to capture characters which are matched by the pattern present inside that group. \1
back-references the characters present inside the capturing group.In the string blaaah
, (.)
captures the first a
and checks for the immediate two occurrences of a
. So aaa
got matched.
Upvotes: 5