Find the most frequent k-mers with mismatches in a DNA string

Question

I am trying to solve finding the most frequent k-mers with mismatches in a string. The requirements are listed below:

Frequent Words with Mismatches Problem: Find the most frequent k-mers with mismatches in a string.

Input: A string Text as well as integers k and d. (You may assume k ≤ 12 and d ≤ 3.)

Output: All most frequent k-mers with up to d mismatches in Text.

Here is an example:

Sample Input:

ACGTTGCATGTCGCATGATGCATGAGAGCT

4 1

Sample Output:

GATG ATGC ATGT

The simplest and most inefficient way is to list all of k-mers in the text and calculate their hamming_difference between each other and pick out patterns whose hamming_difference less than or equal with d, below is my code:

kmer = 4
in_genome = "ACGTTGCATGTCGCATGATGCATGAGAGCT";
in_mistake = 1;
out_result = [];
mismatch_list = []

def hamming_distance(s1, s2):
    # Return the Hamming distance between equal-length sequences
    if len(s1) != len(s2):
        raise ValueError("Undefined for sequences of unequal length")
    else:
        return sum(ch1 != ch2 for ch1, ch2 in zip(s1, s2))

for i in xrange(len(in_genome)-kmer + 1):
    v = in_genome[i:i + kmer]
    out_result.append(v)

for i in xrange(len(out_result)):
    for j in xrange(i + 1, len(out_result)):
        if hamming_distance((out_result[i], out_result[j]) <= in_mistake:
            mismatch_list.append(out_result[i], out_result[j])
    print mismatch_list

When I ran the code it always showed "Invalid syntax" and pointed out the error is from the ":" of "if hamming_distance((out_result[i], out_result[j]) <= in_mistake: ". It could have more errors even I figure out this one. Does anyone know what happened?

Another thing is I know there definitely exists better algorithms. Any clues?

Prune · Accepted Answer

I'm not sure what you're trying to do with mismatch_list: you can't append two things at once. Do you need extend, perhaps, to add each of them? The extra left parenthesis in line cited is easy to fix; this one requires a little more semantic work. replacing those lines with

if hamming_distance(out_result[i], out_result[j]) <= in_mistake:
    mismatch_list.extend([out_result[i], out_result[j]])

Gives this output (mismatch list):

['ACGT', 'ATGT', 'GTTG', 'GTCG', 'GTTG', 'GATG', 'TTGC', 'TCGC', 'TTGC', 'ATGC', 'TGCA', 'CGCA', 'TGCA', 'TGCA', 'GCAT', 'GCAT', 'GCAT', 'GCAT', 'CATG', 'CATG', 'CATG', 'GATG', 'CATG', 'CATG', 'ATGT', 'ATGA', 'ATGT', 'ATGC', 'ATGT', 'ATGA', 'CGCA', 'TGCA', 'GCAT', 'GCAT', 'CATG', 'GATG', 'CATG', 'CATG', 'ATGA', 'ATGC', 'ATGA', 'ATGA', 'TGAT', 'TGAG', 'GATG', 'CATG', 'ATGC', 'ATGA', 'TGAG', 'AGAG', 'GAGA', 'GAGC']

I'm not sure what you want from here; you specified d=1, but the items you printed have 4 errors each. Here is some code that counts the occurrences in mismatch_list, sorts the results by frequency (not used, but you may want it later), and then prints all those that match the assumed frequency of 4.

mismatch_count = collections.Counter(mismatch_list)
sorted_match = sorted(list(mismatch_count), key=lambda pair: pair[1], reverse=True)
for item, value in mismatch_count.iteritems():
    if value == 4:
        print item

Find the most frequent k-mers with mismatches in a DNA string

Answers (1)

Related Questions