Reputation: 165
My input is two strings of the same length and a number which represents the length of the common words I need to find in both strings. I wrote a very straightforward code to do so, and it works, but it is super super slow, considering the fact that each string is ~200K letters.
This is my code:
for i in range(len(X)):
for j in range(len(Y)):
if(X[i] == Y[j]):
for k in range (kmer):
if (X[i+k] == Y[j+k]):
count +=1
else:
count=0
if(count == int(kmer)):
loc=(i,j)
pos.append(loc)
count=0
if(Xcmp[i] == Y[j]):
for k in range (kmer):
if (Xcmp[i+k] == Y[j+k]):
count +=1
else:
count=0
if(count == int(kmer)):
loc=(i,j)
pos.append(loc)
count=0
return pos
Where the first sequence is X and the second is Y and kmer is the length of the common words. (and when I say word, I just mean characters..)
I was able to create a X by kmer matrix (rather than the huge X by Y) but that's still very slow.
I also thought about using a trie, but thought that maybe it will take too long to populate it?
At the end I only need the positions of those common subsequences.
any ideas on how to improve my algorithm? Thanks!! :)
Upvotes: 3
Views: 218
Reputation: 304503
Create a set of words like this
words = {X[i:i+kmer] for i in range(len(X)-kmer+1)}
for i in range(len(Y)-kmer+1):
if Y[i:i+kmer] in words:
print Y[i:i+kmer]
This is fairly efficient as long as kmer
isn't so large that you'd run out of memory for the set. I assume it isn't since you were creating a matrix that size already.
For the positions, create a dict instead of a set as Tim suggests
from collections import defaultdict
wordmap = defaultdict(list)
for i in range(len(X)-kmer+1):
wordmap[X[i:i+kmer]].append(i)
for i in range(len(Y)-kmer+1):
word = Y[i:i+kmer]
if word in wordmap:
print word, wordmap[word], i
Upvotes: 2
Reputation: 7209
A triple nested for loop is giving you a runtime of n^3 because you're literally going through each entry. Consider using Rolling Hash. It has a linear average runtime and worstcase n^2. It's best for finding substrings so more or less what you're doing. In this case you may be closer to n^2 but it's still pretty good over n^3.
Upvotes: 0