Reputation: 200
I am trying to understand a short code for solving clump finding in DNA sequence. The question is
Given integers L and t, a string Pattern forms an (L, t)-clump inside a (larger) string Genome if there is an interval of Genome of length L in which Pattern appears at least t times.
For example, TGCA forms a (25,3)-clump in the following Genome:
gatcagcataagggtcccTGCAaTGCAtgacaagccTGCAgttgttttac
.Clump Finding Problem
Find patterns forming clumps in a string.
Given: A string Genome, and integers k, L, and t.
Return: All distinct k-mers forming (L, t)-clumps in Genome.
And the code is below:
from collections import defaultdict
def search(inseq, k, L, t):
lookup = defaultdict(list)
result = set()
for cursor in range(len(inseq) - k + 1):
seg = inseq[cursor:cursor + k]
# remove prior positions of the same segment
# if they are more than L distance far
while lookup[seg] and cursor + k - lookup[seg][0] > L:
lookup[seg].pop(0)
lookup[seg].append(cursor)
if len(lookup[seg]) == t:
result.add(seg)
return result
Here are my questions,
(1) What is the purpose of using defaultdict instead of dict?
(2) What is a lookup[seg]? Is it the starting position of the k-mer clump?
Upvotes: 0
Views: 2257
Reputation: 76297
1) What is the purpose of using defaultdict
?
defaultdict(list)
allows you to access a key with lookup[seg]
, and "magically" find there a ready list. If the key (seg
) was already there, that is what you'll get. Otherwise, you'll get an empty list. With a normal dictionary, the second one is an error.
(2) What is a lookup[seg]?
It is a list of positions into the sequence, as long as they are close enough.
Upvotes: 1
Reputation: 1849
defaultdict
is a Python object which simply returns a 'default' object if you request a key not in the dictionary. In this case, the default item is a list. Here is the documentation for defaultdict
It appears as though lookup[seg]
returns a list of the positions of segment seg
if they are within L
distance of the portion of the segment being parsed. So the return object to lookup[seg]
is a list of indices into your DNA sequence.
Upvotes: 1