Reputation: 71
I think my code is too inefficient. I'm guessing it has something to do with using strings, though I'm unsure. Here is the code:
genome = FASTAdata[1]
genomeLength = len(genome);
# Hash table holding all the k-mers we will come across
kmers = dict()
# We go through all the possible k-mers by index
for outer in range (0, genomeLength-1):
for inner in range (outer+2, outer+22):
substring = genome[outer:inner]
if substring in kmers: # if we already have this substring on record, increase its value (count of num of appearances) by 1
kmers[substring] += 1
else:
kmers[substring] = 1 # otherwise record that it's here once
This is to search through all substrings of length at most 20. Now this code seems to take pretty forever and never terminate, so something has to be wrong here. Is using [:] on strings causing the huge overhead? And if so, what can I replace it with?
And for clarity the file in question is nearly 200mb, so pretty big.
Upvotes: 7
Views: 217
Reputation: 46872
You should use memoryview to avoid creating sub-strings as [:]
will then return a "view" instead of a copy, BUT you must use Python 3.3 or higher (before that, they are not hashable).
Also, a Counter will simplify your code.
from collections import Counter
genome = memoryview("abcdefghijkrhirejtvejtijvioecjtiovjitrejabababcd".encode('ascii'))
genomeLength = len(genome)
minlen, maxlen = 2, 22
def fragments():
for start in range (0, genomeLength-minlen):
for finish in range (start+minlen, start+maxlen):
if finish <= genomeLength:
yield genome[start:finish]
count = Counter(fragments())
for (mv, n) in count.most_common(3):
print(n, mv.tobytes())
produces:
4 b'ab'
3 b'jt'
3 b'ej'
A 1,000,000 byte random array takes 45s on my laptop, but 2,000,000 causes swapping (over 8GB memory use). However, since your fragment size is small, you can easily break the problem up into million-long sub-sequences and then combine results at the end (just be careful about overlaps). That would give a total running time for a 200MB array of ~3 hours, with luck.
PS To be clear, by "combine results at the end" I assume that you only need to save the most popular for each 1M sub-sequence, by, for example, writing them to a file. You cannot keep the counter in memory - that is what is using the 8GB. That's fine if you have fragments that occur many thousands of times, but obviously won't work for smaller numbers (you might see a fragment just once in each of the 200 1M sub-sequences, and so never save it, for example). In other words, results will be lower bounds that are incomplete, particularly at lower frequencies (values are complete only if a fragment is found and recorded in every sub-sequence). If you need an exact result, this is not suitable.
Upvotes: 1
Reputation: 13632
I would recommend using a dynamic programming algorithm. The problem is that for all inner
strings that are not found, you are re-searching those again with extra characters appended onto them, so of course those will also not be found. I do not have a specific algorithm in mind, but this is certainly a case for dynamic programming where you remember what you have already searched for. As a really crummy example, remember all substrings
of length 1,2,3,... that are not found, and never extend those bases in the next iteration where the strings are only longer.
Upvotes: 1