Reputation: 959
I have a file of short strings, which I have loaded in a list short
(there are 1.5 million short strings of length 150). I want to find the number of these short strings that are present in a longer string (of length ~ 5 million) which is seq
in the code. I use the following obvious implementation. However, this seems to take a long time (around a day) to run.
count1=count2=0
for line in short:
count1+=1
if line in seq:
count2+=1
print str(count2) + ' of ' + str(count1) + ' strings are in long string.'
Is there a way I can do this more efficiently?
Upvotes: 6
Views: 3679
Reputation: 35761
Do profiling, and try different options. You will not get around iterating through your sequence of "test" strings, so for line in short
is something you will most probably keep. The test if line in seq
is I think quite efficiently implemented in CPython, but I think this is not optimized for searching a small needle in a laaaaarge haystack. Your requirements are a bit extreme, and I guess that it is exactly this test takes that quite a while and is the bottleneck of your code. You might want to try, just as a comparison, the regex
module for searching the needle in the haystack.
Edit:
A rudimentary benchmark (no repetitions, no scaling behavior investigated, no profile module used), for comparing the methods discussed in this thread:
import string
import random
import time
def genstring(N):
return ''.join(random.choice(string.ascii_uppercase) for _ in xrange(N))
t0 = time.time()
length_longstring = 10**6
length_shortstring = 7
nbr_shortstrings = 3*10**6
shortstrings = [genstring(length_shortstring) for _ in xrange(nbr_shortstrings)]
longstring = genstring(length_longstring)
duration = time.time() - t0
print "Setup duration: %.1f s" % duration
def method_1():
count1 = 0
count2 = 0
for ss in shortstrings:
count1 += 1
if ss in longstring:
count2 += 1
print str(count2) + ' of ' + str(count1) + ' strings are in long string.'
#t0 = time.time()
#method_1()
#duration = time.time() - t0
#print "M1 duration: %.1f s" % duration
def method_2():
shortset = set()
for i in xrange(len(longstring)-length_shortstring+1):
shortset.add(longstring[i:i+length_shortstring])
count1 = 0
count2 = 0
for ss in shortstrings:
count1 += 1
if ss in shortset:
count2 += 1
print str(count2) + ' of ' + str(count1) + ' strings are in long string.'
t0 = time.time()
method_2()
duration = time.time() - t0
print "M2 duration: %.1f s" % duration
def method_3():
shortset = set(
longstring[i:i+length_shortstring] for i in xrange(
len(longstring)-length_shortstring+1))
count1 = len(shortstrings)
count2 = sum(1 for ss in shortstrings if ss in shortset)
print str(count2) + ' of ' + str(count1) + ' strings are in long string.'
t0 = time.time()
method_3()
duration = time.time() - t0
print "M3 duration: %.1f s" % duration
Output:
$ python test.py
Setup duration: 23.3 s
364 of 3000000 strings are in long string.
M2 duration: 1.4 s
364 of 3000000 strings are in long string.
M3 duration: 1.2 s
(This is Python 2.7.3 on Linux, on a E5-2650 0 @ 2.00GHz)
There is a slight difference between the method proposed by nneonneo and the improvements suggested by chepner. Under these conditions, it is already no fun to execute the original code. Under a little less extreme conditions we can make a comparison among all three methods:
length_longstring = 10**6
length_shortstring = 5
nbr_shortstrings = 10**5
->
$ python test1.py
Setup duration: 1.4 s
8121 of 100000 strings are in long string.
M1 duration: 95.0 s
8121 of 100000 strings are in long string.
M2 duration: 0.4 s
8121 of 100000 strings are in long string.
M3 duration: 0.4 s
Upvotes: 2
Reputation: 5514
Ok, I know you've already accepted another answer which works well, but just for the sake of completeness, here's a filled out version of what RedX suggested in the comments (I think)
import itertools
PREFIXLEN = 50 #This will need to be adjusted for efficiency, consider doing a sensitivity study
commonpres = itertools.groupby(sorted(short), lambda x: x[0:PREFIXLEN])
survivors = []
precount = 0
for pres in commonpres:
precount += 1
if pres[0] in seq:
survivors.extend(pres[1])
postcount = len(survivors)
actcount = 0
for survivor in survivors:
if survivor in seq:
actcount += 1
print "{} of {} strings are in long string.".format(actcount, len(short))
print "{} short strings ruled out by groups".format(len(short) - len(survivors))
print "{} total comparisons done".format(len(survivors) + precount)
The idea here is to rule out as many common prefixes as possible before running through all the survivors
of said checks. In an extreme example, suppose your 1.5 million short strings fit into 10 common prefixs. For simplicity, lets also suppose that they are evenly divided (150,000) per prefix. If we can eliminate two of those prefixes with 10 checks, then we save 300,000 checks later. This is why PREFIXLEN
needs to be "tuned." If its too low, you'll have too many common prefixes, and you won't save any checks (prefix of length one = 1.5 million checks). Where as a PREFIXLEN
which is too high will give you no gains from eliminating prefixes since number of eliminations will be small. I arbitrarily picked 50, that may or may not help you.
As I said before, this answer is pretty much academic, so if anyone sees anything to be improved, please comment or just edit.
Upvotes: 1
Reputation: 179422
If the short
strings are a constant length (you indicated they were 150 long), you can preprocess the long string to extract all the short strings, then just do set lookups (which are constant time in expectation):
shortlen = 150
shortset = set()
for i in xrange(len(seq)-shortlen+1):
shortset.add(seq[i:i+shortlen])
for line in short:
count1 += 1
if line in shortset:
count2 += 1
The running time for this is probably going to be dominated by the preprocessing step (because it inserts nearly 5M strings of length 150 each), but that should still be faster than 1.5M searches in a 5M character string.
Upvotes: 5