Reputation: 2491
I am trying to find all forms of insertions between 2 strings. So I have a list of 14 million strings and then I have to check for each string what possible insertions can transform one string to another (basically counting insertion frequencies). Say x is one string and y is another string where x is a sub-string of y, so we have to find out what insertions convert x to y.
I am using the following code segment. It works, but is taking way to much time. I even tried to distribute the load across 64 processors still it will take 20 days to complete.
for i in Words:
#trying to distribute load across different processes, so can ignore this part
h = hashlib.sha256(i)
n = int(h.hexdigest(),base=16)
if (n%64!=ix): #ix is a process based id
continue
for j in Words:#
if len(i)>len(j):
continue
if( i!=j and i in j): # i is a substring of j
ind=j.find(i)
s1=j[0:ind]
s2=j[ind+len(i):len(j)]
if(len(s1)>0):
if (not transform.has_key(s1)):
transform[s1]=1
else:
transform[s1]+=1
if(len(s2)>0):
if (not transform.has_key(s2)):
transform[s2]=1
else:
transform[s2]+=1
Upvotes: 1
Views: 103
Reputation: 4723
Instead of comparing each word to each other (quadratic runtime), take each proper substring of each word (linear runtime, assuming word length is bounded) and check if it is in the set of words (looking up elements of a set
is constant time).
This ran in less than 2 seconds on my laptop (for 46265 words (of length < 10) with 47015 unique transforms (799089 total)):
from collections import Counter
# for testing
from random import choice, randrange
from string import ascii_uppercase
big_word = "".join(choice(ascii_uppercase) for i in range(10000))
words = [big_word[randrange(len(big_word)):][:randrange(1, 10)] for i in range(100000)] # words of up to 9 letters; all are substrings of big_word
# now the real code
def insertions(words):
for word in words:
for i in range(1, len(word) - 1):
ins = word[:i]
rest = word[i:]
for j in range(1, len(rest)):
if rest[:j] in words:
yield ins
for i in range(1, len(word) - 1):
rest = word[:i]
ins = word[i:]
for j in range(len(rest) - 1):
if rest[j:] in words:
yield ins
transforms = Counter(insertions(set(words)))
Upvotes: 1