Reputation: 200
I am trying to solve finding the most frequent k-mers with mismatches in a string. The requirements are listed below:
Frequent Words with Mismatches Problem: Find the most frequent k-mers with mismatches in a string.
Input: A string Text as well as integers k and d. (You may assume k ≤ 12 and d ≤ 3.)
Output: All most frequent k-mers with up to d mismatches in Text.
Here is an example:
Sample Input:
ACGTTGCATGTCGCATGATGCATGAGAGCT
4 1
Sample Output:
GATG ATGC ATGT
The simplest and most inefficient way is to list all of k-mers in the text and calculate their hamming_difference between each other and pick out patterns whose hamming_difference less than or equal with d, below is my code:
kmer = 4
in_genome = "ACGTTGCATGTCGCATGATGCATGAGAGCT";
in_mistake = 1;
out_result = [];
mismatch_list = []
def hamming_distance(s1, s2):
# Return the Hamming distance between equal-length sequences
if len(s1) != len(s2):
raise ValueError("Undefined for sequences of unequal length")
else:
return sum(ch1 != ch2 for ch1, ch2 in zip(s1, s2))
for i in xrange(len(in_genome)-kmer + 1):
v = in_genome[i:i + kmer]
out_result.append(v)
for i in xrange(len(out_result)):
for j in xrange(i + 1, len(out_result)):
if hamming_distance((out_result[i], out_result[j]) <= in_mistake:
mismatch_list.append(out_result[i], out_result[j])
print mismatch_list
When I ran the code it always showed "Invalid syntax" and pointed out the error is from the ":" of "if hamming_distance((out_result[i], out_result[j]) <= in_mistake: ". It could have more errors even I figure out this one. Does anyone know what happened?
Another thing is I know there definitely exists better algorithms. Any clues?
Upvotes: 2
Views: 3874
Reputation: 77837
I'm not sure what you're trying to do with mismatch_list: you can't append two things at once. Do you need extend, perhaps, to add each of them? The extra left parenthesis in line cited is easy to fix; this one requires a little more semantic work. replacing those lines with
if hamming_distance(out_result[i], out_result[j]) <= in_mistake:
mismatch_list.extend([out_result[i], out_result[j]])
Gives this output (mismatch list):
['ACGT', 'ATGT', 'GTTG', 'GTCG', 'GTTG', 'GATG', 'TTGC', 'TCGC', 'TTGC', 'ATGC', 'TGCA', 'CGCA', 'TGCA', 'TGCA', 'GCAT', 'GCAT', 'GCAT', 'GCAT', 'CATG', 'CATG', 'CATG', 'GATG', 'CATG', 'CATG', 'ATGT', 'ATGA', 'ATGT', 'ATGC', 'ATGT', 'ATGA', 'CGCA', 'TGCA', 'GCAT', 'GCAT', 'CATG', 'GATG', 'CATG', 'CATG', 'ATGA', 'ATGC', 'ATGA', 'ATGA', 'TGAT', 'TGAG', 'GATG', 'CATG', 'ATGC', 'ATGA', 'TGAG', 'AGAG', 'GAGA', 'GAGC']
I'm not sure what you want from here; you specified d=1, but the items you printed have 4 errors each. Here is some code that counts the occurrences in mismatch_list, sorts the results by frequency (not used, but you may want it later), and then prints all those that match the assumed frequency of 4.
mismatch_count = collections.Counter(mismatch_list)
sorted_match = sorted(list(mismatch_count), key=lambda pair: pair[1], reverse=True)
for item, value in mismatch_count.iteritems():
if value == 4:
print item
Upvotes: 2