Reputation: 406
I have been trying to figure out what's wrong with this bioinformatics code for hours and I can't see it. the pieces of my function appear to work, but it's not seeing certain patterns. I'm using a sliding window function to return the number of times a certain combination of base pairs of length k
shows up in a piece of text.
the first function I need essentially creates the index of the quaternary number of the nucleotide string:
nucs = {'A':0,'C':1,'G':2,'T':3}
def PatternToNumber(Pattern):
index = 0
power = []
for i in range(len(Pattern)-1,-1,-1):
power.append(i)
for i in range(len(Pattern)):
index += nucs[Pattern[i]]*(4**power[i])
return index
and the next function I use iterates down a chunk of text and adds 1 to the index in a frequency array.
def ComputingFrequencies(Text,k):
FrequencyArray = [0]*(4**k)
for i in range(len(Text)-k):
Pattern = Text[i:i+k]
index = PatternToNumber(Pattern)
FrequencyArray[index] += 1
print(*FrequencyArray)
Like I said I've looked into every line and it seems to work fine getting nucleotide patterns into index numbers the way I would expect them to, but the output you get running:
ComputeFrequencies('ACGCGGCTCTGAAA',2)
is:
1 1 0 0 0 0 2 2 1 2 1 0 0 1 1 0
if you look at the first number in the FrequencyArray
it would tell us that the string 'AA' only shows up 1, but the last three characters in the Text input are 'AAA' which would mean 'AA' shows up twice and the first entry in the FrequencyArray
should be 2 and not 1. What we should expect is:
2 1 0 0 0 0 2 2 1 2 1 0 0 1 1 0
If I've not explained it well, I can try to clarify my code a bit if needed.
Upvotes: 1
Views: 50
Reputation: 802
I'm pretty sure you just have an off by 1 error. Since you're just not checking up to the last character?
for i in range(len(Text)-k):
For length 2, it will only iterate up to the first AA
, so you're only seeing it once. Change to
for i in range(len(Text)-(k-1)):
And I think that should give you what you want.
Upvotes: 1