nosense
nosense

Reputation: 200

Clump Finding in DNA sequence

I am trying to understand a short code for solving clump finding in DNA sequence. The question is

Given integers L and t, a string Pattern forms an (L, t)-clump inside a (larger) string Genome if there is an interval of Genome of length L in which Pattern appears at least t times.

For example, TGCA forms a (25,3)-clump in the following Genome: gatcagcataagggtcccTGCAaTGCAtgacaagccTGCAgttgttttac.

Clump Finding Problem

Find patterns forming clumps in a string.

Given: A string Genome, and integers k, L, and t.

Return: All distinct k-mers forming (L, t)-clumps in Genome.

And the code is below:

from collections import defaultdict

def search(inseq, k, L, t):
    lookup = defaultdict(list)
    result = set()
    
    for cursor in range(len(inseq) - k + 1):
        seg = inseq[cursor:cursor + k]
        
        # remove prior positions of the same segment
        # if they are more than L distance far
        while lookup[seg] and cursor + k - lookup[seg][0] > L:
            lookup[seg].pop(0)
        
        lookup[seg].append(cursor)
        if len(lookup[seg]) == t:
            result.add(seg)
    
    return result

Here are my questions,

(1) What is the purpose of using defaultdict instead of dict?

(2) What is a lookup[seg]? Is it the starting position of the k-mer clump?

Upvotes: 0

Views: 2257

Answers (2)

Ami Tavory
Ami Tavory

Reputation: 76297

1) What is the purpose of using defaultdict?

defaultdict(list) allows you to access a key with lookup[seg], and "magically" find there a ready list. If the key (seg) was already there, that is what you'll get. Otherwise, you'll get an empty list. With a normal dictionary, the second one is an error.

(2) What is a lookup[seg]?

It is a list of positions into the sequence, as long as they are close enough.

Upvotes: 1

Alex Alifimoff
Alex Alifimoff

Reputation: 1849

defaultdict is a Python object which simply returns a 'default' object if you request a key not in the dictionary. In this case, the default item is a list. Here is the documentation for defaultdict

It appears as though lookup[seg] returns a list of the positions of segment seg if they are within L distance of the portion of the segment being parsed. So the return object to lookup[seg] is a list of indices into your DNA sequence.

Upvotes: 1

Related Questions