Reputation: 200
I am trying to solve finding the most frequent k-mers with mismatches in a string. The requirements are listed below:
Frequent Words with Mismatches Problem: Find the most frequent k-mers with mismatches in a string.
Input: A string Text as well as integers k and d. (You may assume k ≤ 12 and d ≤ 3.)
Output: All most frequent k-mers with up to d mismatches in Text.
Here is an example:
Sample Input:
ACGTTGCATGTCGCATGATGCATGAGAGCT
4 1
Sample Output:
GATG ATGC ATGT
The simplest and most inefficient way is to list all of k-mers in the text and calculate their hamming_difference between each other and pick out patterns whose hamming_difference less than or equal with d, below is my code:
import collections
kmer = 4
in_genome = "ACGTTGCATGTCGCATGATGCATGAGAGCT";
in_mistake = 1;
out_result = [];
mismatch_list = []
def hamming_distance(s1, s2):
# Return the Hamming distance between equal-length sequences
if len(s1) != len(s2):
raise ValueError("Undefined for sequences of unequal length")
else:
return sum(ch1 != ch2 for ch1, ch2 in zip(s1, s2))
for i in xrange(len(in_genome)-kmer + 1):
v = in_genome[i:i + kmer]
out_result.append(v)
for i in xrange(len(out_result) - 1):
for j in xrange(i+1, len(out_result)):
if hamming_distance(str(out_result[i]), str(out_result[j])) <= in_mistake:
mismatch_list.extend([out_result[i], out_result[j]])
mismatch_count = collections.Counter(mismatch_list)
print [key for key,val in mismatch_count.iteritems() if val == max(mismatch_count.values())]
Instead of the expected results, I got 'CATG'. Does anyone know something wrong with my code?
Upvotes: 0
Views: 2667
Reputation: 41872
It all seems great until your last line of code:
print [key for key,val in mismatch_count.iteritems() if val == max(mismatch_count.values())]
Since CATG scored higher than any other kmer, you'll only ever get that one answer. Take a look at:
>>> print mismatch_count.most_common()
[('CATG', 9), ('ATGA', 6), ('GCAT', 6), ('ATGC', 4), ('TGCA', 4), ('ATGT', 4), ('GATG', 4), ('GTTG', 2), ('TGAG', 2), ('TTGC', 2), ('CGCA', 2), ('TGAT', 1), ('GTCG', 1), ('AGAG', 1), ('ACGT', 1), ('TCGC', 1), ('GAGC', 1), ('GAGA', 1)]
to figure out what it is you really want back from this result.
I believe the fix is to change your second top level 'for' loop to read as follows:
for t_kmer in set(out_result):
for s_kmer in out_result:
if hamming_distance(t_kmer, s_kmer) <= in_mistake:
mismatch_list.append(t_kmer)
This produces a result similar to what you're expecting:
>>> print mismatch_count.most_common()
[('ATGC', 5), ('ATGT', 5), ('GATG', 5), ('CATG', 4), ('ATGA', 4), ('GTTG', 3), ('CGCA', 3), ('GCAT', 3), ('TGAG', 3), ('TTGC', 3), ('TGCA', 3), ('TGAT', 2), ('GTCG', 2), ('AGAG', 2), ('ACGT', 2), ('TCGC', 2), ('GAGA', 2), ('GAGC', 2), ('TGTC', 1), ('CGTT', 1), ('AGCT', 1)]
Upvotes: 1