Find the most frequent k-mers with mismatches in a text

Question

I am trying to solve finding the most frequent k-mers with mismatches in a string. The requirements are listed below:

Frequent Words with Mismatches Problem: Find the most frequent k-mers with mismatches in a string.

Input: A string Text as well as integers k and d. (You may assume k ≤ 12 and d ≤ 3.)

Output: All most frequent k-mers with up to d mismatches in Text.

Here is an example:

Sample Input:

ACGTTGCATGTCGCATGATGCATGAGAGCT

4 1

Sample Output:

GATG ATGC ATGT

The simplest and most inefficient way is to list all of k-mers in the text and calculate their hamming_difference between each other and pick out patterns whose hamming_difference less than or equal with d, below is my code:

import collections

kmer = 4
in_genome = "ACGTTGCATGTCGCATGATGCATGAGAGCT";
in_mistake = 1;
out_result = [];
mismatch_list = []

def hamming_distance(s1, s2):
    # Return the Hamming distance between equal-length sequences
    if len(s1) != len(s2):
        raise ValueError("Undefined for sequences of unequal length")
    else:
        return sum(ch1 != ch2 for ch1, ch2 in zip(s1, s2))

for i in xrange(len(in_genome)-kmer + 1):
    v = in_genome[i:i + kmer]
    out_result.append(v)

for i in xrange(len(out_result) - 1):
    for j in xrange(i+1, len(out_result)):
        if hamming_distance(str(out_result[i]), str(out_result[j])) <= in_mistake:
            mismatch_list.extend([out_result[i], out_result[j]])

mismatch_count = collections.Counter(mismatch_list)
print [key for key,val in mismatch_count.iteritems() if val == max(mismatch_count.values())]

Instead of the expected results, I got 'CATG'. Does anyone know something wrong with my code?

cdlane · Accepted Answer

It all seems great until your last line of code:

print [key for key,val in mismatch_count.iteritems() if val == max(mismatch_count.values())]

Since CATG scored higher than any other kmer, you'll only ever get that one answer. Take a look at:

>>> print mismatch_count.most_common()
[('CATG', 9), ('ATGA', 6), ('GCAT', 6), ('ATGC', 4), ('TGCA', 4), ('ATGT', 4), ('GATG', 4), ('GTTG', 2), ('TGAG', 2), ('TTGC', 2), ('CGCA', 2), ('TGAT', 1), ('GTCG', 1), ('AGAG', 1), ('ACGT', 1), ('TCGC', 1), ('GAGC', 1), ('GAGA', 1)]

to figure out what it is you really want back from this result.

I believe the fix is to change your second top level 'for' loop to read as follows:

for t_kmer in set(out_result):
    for s_kmer in out_result:
        if hamming_distance(t_kmer, s_kmer) <= in_mistake:
            mismatch_list.append(t_kmer)

This produces a result similar to what you're expecting:

>>> print mismatch_count.most_common()
[('ATGC', 5), ('ATGT', 5), ('GATG', 5), ('CATG', 4), ('ATGA', 4), ('GTTG', 3), ('CGCA', 3), ('GCAT', 3), ('TGAG', 3), ('TTGC', 3), ('TGCA', 3), ('TGAT', 2), ('GTCG', 2), ('AGAG', 2), ('ACGT', 2), ('TCGC', 2), ('GAGA', 2), ('GAGC', 2), ('TGTC', 1), ('CGTT', 1), ('AGCT', 1)]

Find the most frequent k-mers with mismatches in a text

Answers (1)

Related Questions