FairyDuster
FairyDuster

Reputation: 165

An efficient way to search similar words (with specified length) in two strings using python

My input is two strings of the same length and a number which represents the length of the common words I need to find in both strings. I wrote a very straightforward code to do so, and it works, but it is super super slow, considering the fact that each string is ~200K letters.

This is my code:

for i in range(len(X)):
    for j in range(len(Y)):
        if(X[i] == Y[j]):
            for k in range (kmer):                
                if (X[i+k] == Y[j+k]):
                    count +=1
                else:
                    count=0
                if(count == int(kmer)):
                    loc=(i,j)
                    pos.append(loc)
                    count=0    

        if(Xcmp[i] == Y[j]):
            for k in range (kmer):                
                if (Xcmp[i+k] == Y[j+k]):
                    count +=1
                else:
                    count=0
                if(count == int(kmer)):
                    loc=(i,j)
                    pos.append(loc)
                    count=0

return pos 

Where the first sequence is X and the second is Y and kmer is the length of the common words. (and when I say word, I just mean characters..)

I was able to create a X by kmer matrix (rather than the huge X by Y) but that's still very slow.

I also thought about using a trie, but thought that maybe it will take too long to populate it?

At the end I only need the positions of those common subsequences.

any ideas on how to improve my algorithm? Thanks!! :)

Upvotes: 3

Views: 218

Answers (2)

John La Rooy
John La Rooy

Reputation: 304503

Create a set of words like this

words = {X[i:i+kmer] for i in range(len(X)-kmer+1)}
for i in range(len(Y)-kmer+1):
    if Y[i:i+kmer] in words:
        print Y[i:i+kmer]

This is fairly efficient as long as kmer isn't so large that you'd run out of memory for the set. I assume it isn't since you were creating a matrix that size already.

For the positions, create a dict instead of a set as Tim suggests

from collections import defaultdict
wordmap = defaultdict(list)
for i in range(len(X)-kmer+1):
    wordmap[X[i:i+kmer]].append(i)

for i in range(len(Y)-kmer+1):
    word = Y[i:i+kmer]
    if word in wordmap:
        print word, wordmap[word], i

Upvotes: 2

ZekeDroid
ZekeDroid

Reputation: 7209

A triple nested for loop is giving you a runtime of n^3 because you're literally going through each entry. Consider using Rolling Hash. It has a linear average runtime and worstcase n^2. It's best for finding substrings so more or less what you're doing. In this case you may be closer to n^2 but it's still pretty good over n^3.

Upvotes: 0

Related Questions