Reputation: 1131
If I have a pattern 'haha' and a text 'hahahaha', I want the indices of the pattern to be returned in a list like [0,2,4]. However, if I do it this way
def find_pat(text, pattern)
import re
return[x.start() for x in re.finditer(pattern,text)]
it returns [0,4], and fails to see the repetition of the pattern at index 2. How would I achieve the needed output in the most pythonic and efficient way?
Upvotes: 0
Views: 144
Reputation: 366133
As Joran Beasley's answer shows, using plain string functions will be much faster than using regexps for static string lookups.
But testing startswith
N times could itself be a huge slowdown if N is very large and matches are uncommon. Also, since you were using finditer
rather than findall
, I suspect you may be worried about that case.
You can get the best of both worlds by using str.find
. Of course ultimately this is doing the same work as using startswith
at each point—but it's doing that work in C, zipping over long stretches with no matches as much as 20x faster.
On the other hand, there's no way to wrap this repeated find
up in a trivial looping expression. (Unless you build a complicated wrapper using iter
around a closure, but I doubt that would actually help anything.) So, the code will look more complicated than Joran's listcomp. And it could be slower when matches are very common (because in that case you're spending most of your time in the loop, and an explicit loop statement is slower than a comprehension).
On the plus side, the extra verbosity means it's more obvious how to customize it. For example, if you decide you want to skip overlapped matches, just do i += len(pattern)
instead of i += 1
.
def finditer(text, pattern):
i = 0
while True:
i = text.find(pattern, i)
if i == -1: return
yield i
i += 1
From a quick test (under 64-bit Apple CPython 2.7.5):
In [931]: pattern = 'ha'
In [932]: text = 'hahahaha'
In [933]: %timeit [i for i in range(len(text)-len(pattern)+1) if text[i:].startswith(pattern)]
100000 loops, best of 3: 2.69 µs per loop
In [934]: %timeit list(finditer(text, pattern))
100000 loops, best of 3: 3.56 µs per loop
In [935]: text = ('hahahaha' + string.ascii_lowercase + 'ha')*100
pattern = 'ha'
In [936]: %timeit [i for i in range(len(text)-len(pattern)+1) if text[i:].startswith(pattern)]
100000 loops, best of 3: 1.74 ms per loop
In [937]: %timeit list(finditer(text, pattern))
100000 loops, best of 3: 180 µs per loop
So, it's almost as fast as Joran's code even for a very short string with 50% matches; for a much longer string with 11% matches, it's already 9.6x faster. If matches are even less common, or if we don't actually need a list, obviously it will win even bigger.
Upvotes: 1
Reputation: 54243
I'm sure there's a better way to do this. Here's my first take on it though:
EDIT: I misread the question -- he was trying to match [haha]haha, ha[haha]ha, and haha[haha]. Mine only matches [haha]haha and haha[haha] because I thought he was looking for unique matches. Reading comprehension is a plus, when programming.
def find_text(text,pattern):
indexes = list()
index = 0
patlen = len(pattern)
while index<=len(text)-patlen:
if text[index:].startswith(pattern):
indexes.append(index)
index+=patlen
else:
index+=1
return indexes
Upvotes: 0
Reputation: 114098
pattern = "haha"
text = "hahahaha"
[i for i in range(len(text)-len(pattern)+1) if text[i:].startswith(pattern)]
iter will consume the tokens it has seen so it consumes the first four letters with the first match ...
actually the solution using the look ahead is probably the right one(based on the question rather than the usecase) but this would also work ...
Upvotes: 3
Reputation:
Use positive lookahead assertion:
>>> import re
>>> def find_pat(text, pattern):
... return [x.start() for x in re.finditer("(?={})".format(pattern), text)]
...
>>> find_pat('hahahaha', 'haha')
[0, 2, 4]
>>>
Here is a reference.
Upvotes: 3