Reputation: 41
How can I write a code to find the most frequent 2-mer of "GATCCAGATCCCCATAC". I have written this code but it seems that I am wrong, please help in correcting me.
def PatternCount(Pattern, Text):
count = 0
for i in range(len(Text)-len(Pattern)+1):
if Text[i:i+len(Pattern)] == Pattern:
count = count+1
return count
This code prints the most frequent k-mer in a string but it don't give me the 2-mer in the given string.
Upvotes: 4
Views: 3407
Reputation: 44615
If you want a simple approach, consider a sliding window technique. An implementation is available in more_itertools, so you don't have to make one yourself. This is easy to use if you pip install more_itertools
.
Simple Example
>>> from collections import Counter
>>> import more_itertools
>>> s = "GATCCAGATCCCCATAC"
>>> Counter(more_itertools.windowed(s, 2))
Counter({('A', 'C'): 1,
('A', 'G'): 1,
('A', 'T'): 3,
('C', 'A'): 2,
('C', 'C'): 4,
('G', 'A'): 2,
('T', 'A'): 1,
('T', 'C'): 2})
The above example demonstrates what little is required to get most of the information you want using windowed
and Counter
.
Description
A "window" or container of length k=2
is sliding across the sequence one stride at a time (e.g. step=1
). Each new group is added as a key to the Counter
dictionary. For each occurrence, the tally is incremented. The final Counter
object primarily reports all tallies and includes other helpful features.
Final Solution
If actual string pairs is important, that is simple too. We will make a general function that groups the strings and works for any k mers:
>>> from collections import Counter
>>> import more_itertools
>>> def count_mers(seq, k=1):
... """Return a counter of adjacent mers."""
... return Counter(("".join(mers) for mers in more_itertools.windowed(seq, k)))
>>> s = "GATCCAGATCCCCATAC"
>>> count_mers(s, k=2)
Counter({'AC': 1,
'AG': 1,
'AT': 3,
'CA': 2,
'CC': 4,
'GA': 2,
'TA': 1,
'TC': 2})
Upvotes: 3
Reputation: 5921
You can first define a function to get all the k-mer in your string :
def get_all_k_mer(string, k=1):
length = len(string)
return [string[i: i+ k] for i in xrange(length-k+1)]
Then you can use collections.Counter
to count the repetition of each k-mer:
>>> from collections import Counter
>>> s = 'GATCCAGATCCCCATAC'
>>> Counter(get_all_k_mer(s, k=2))
Ouput :
Counter({'AC': 1,
'AG': 1,
'AT': 3,
'CA': 2,
'CC': 4,
'GA': 2,
'TA': 1,
'TC': 2})
Another example :
>>> s = "AAAAAA"
>>> Counter(get_all_k_mer(s, k=3))
Output :
Counter({'AAA': 4})
# Indeed : AAAAAA
^^^ -> 1st time
^^^ -> 2nd time
^^^ -> 3rd time
^^^ -> 4th time
Upvotes: 7
Reputation: 61063
In general, when I want to count things with python I use a Counter
from itertools import tee
from collections import Counter
dna = "GATCCAGATCCCCATAC"
a, b = tee(iter(dna), 2)
_ = next(b)
c = Counter(''.join(l) for l in zip(a,b))
print(c.most_common(1))
This prints [('CC', 4)]
, a list of the 1
most common 2-mers in a tuple with their count in the string.
In fact, we can generalize this to the find the most common n-mer for a given n
.
from itertools import tee, islice
from collections import Counter
def nmer(dna, n):
iters = tee(iter(dna), n)
iters = [islice(it, i, None) for i, it in enumerate(iters)]
c = Counter(''.join(l) for l in zip(*iters))
return c.most_common(1)
Upvotes: 3