Reputation: 59
I am trying to make a probability matrix of this sequence:
'DITCCGQFHFAIIYHDWQYKIFRYAATSPVKEPWKHRMWYSIVAANDVENCNSFHGPYQQ KHQWQDNTAQYLEYKTIGYQKRDQPNNVWIHHPMVYYEPVHYRQFNDRQAFTYSDQFCSK SCTIIWNGEANQCHNKQTASDHTGWPRMFAYLKENYTQYSTFFICMLDKYTCSNMKSLPE MHWELMEWALMCSCEKERARYQCNSWRKSIADPEFNYCIAWMFCKHEEKGEETRCEQKHQ ALLPPHEDYGDSLNDCQVNNGEPYTTKGEQRVKLQKEGHKNEQCRKATKRKYQASQCEAK REMMKNWRSYTATESNARVMQHWRQWRLHSMCVITDDHTQRRETCEAKENRMLRTALHIW VVWASHWFPVMNITQIWTGEDHGDHNSFLALCDSVVASYRILEQQLECCPNEDQCPMSIF HYKVKMCWEWRIVYAPNQSHTRNCALDFKKMEPIAGMMHCPGMQSGMLTSDRPVLEPGSV ENPLFDNHVRFSYFFEQVNNGKFMLECSTCGDNEEIFGYHCIVQNYQDCASAKSAIFCFM FANQHAERGWSPGLIVRNF'
A protein of amino acid sequences.
With as alphabet:
alphabet = ('A', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'K', 'L','M', 'N', 'P', 'Q', 'R', 'S', 'T', 'V', 'W', 'Y')
I created an empty matrix:
prob_matrix = {}
for i in alphabet:
prob_matrix[i] = {}
for j in alphabet:
prob_matrix[i][j] = 0.0
But I am struggeling to fill in this matrix with numbers based on my sequence. Can anyone help me with this formula?
Afterwards I can convert this to probabilities with this funtion:
for row in prob_matrix:
total = sum([prob_matrix[row][column] for column in prob_matrix[row]])
if total > 0:
for column in prob_matrix[row]:
prob_matrix[row][column] /= total
If that is correct.
Can anyone help me with the step inbetween? Or help me with creating a whole new formula?
Upvotes: 1
Views: 264
Reputation: 26325
I would probably start off by stripping the whitespace from your sequence string:
sequence = "DITCCGQFHFAIIYHDWQYKIFRYAATSPVKEPWKHRMWYSIVAANDVENCNSFHGPYQQ KHQWQDNTAQYLEYKTIGYQKRDQPNNVWIHHPMVYYEPVHYRQFNDRQAFTYSDQFCSK SCTIIWNGEANQCHNKQTASDHTGWPRMFAYLKENYTQYSTFFICMLDKYTCSNMKSLPE MHWELMEWALMCSCEKERARYQCNSWRKSIADPEFNYCIAWMFCKHEEKGEETRCEQKHQ ALLPPHEDYGDSLNDCQVNNGEPYTTKGEQRVKLQKEGHKNEQCRKATKRKYQASQCEAK REMMKNWRSYTATESNARVMQHWRQWRLHSMCVITDDHTQRRETCEAKENRMLRTALHIW VVWASHWFPVMNITQIWTGEDHGDHNSFLALCDSVVASYRILEQQLECCPNEDQCPMSIF HYKVKMCWEWRIVYAPNQSHTRNCALDFKKMEPIAGMMHCPGMQSGMLTSDRPVLEPGSV ENPLFDNHVRFSYFFEQVNNGKFMLECSTCGDNEEIFGYHCIVQNYQDCASAKSAIFCFM FANQHAERGWSPGLIVRNF"
stripped_sequence = sequence.replace(" ", "")
Then get the totals for each letter in the sequence using collections.Counter()
:
from collections import Counter
totals = Counter(stripped_sequence)
# Counter({'E': 41, 'Q': 37, 'N': 34, 'A': 33, 'K': 32, 'S': 32, 'C': 31, 'R': 29, 'H': 28, 'Y': 28, 'T': 26, 'F': 26, 'D': 24, 'I': 24, 'M': 24, 'V': 23, 'L': 23, 'P': 22, 'G': 21, 'W': 21})
And now you can get your probabilties by dividing the number of each sequence letter by the total number of sequence letters:
sequence_len = len(stripped_sequence)
probabilities = {
letter: totals[letter] / sequence_len for letter in alphabet
}
print(probabilities)
Which Outputs the following probablities:
{'A': 0.059033989266547404, 'C': 0.055456171735241505, 'D': 0.04293381037567084, 'E': 0.07334525939177101, 'F': 0.046511627906976744, 'G': 0.03756708407871199, 'H': 0.05008944543828265, 'I': 0.04293381037567084, 'K': 0.057245080500894455, 'L': 0.04114490161001789, 'M': 0.04293381037567084, 'N': 0.06082289803220036, 'P': 0.03935599284436494, 'Q': 0.06618962432915922, 'R': 0.0518783542039356, 'S': 0.057245080500894455, 'T': 0.046511627906976744, 'V': 0.04114490161001789, 'W': 0.03756708407871199, 'Y': 0.05008944543828265}
Which should sum near to 1.
Upvotes: 2
Reputation: 10979
This should get you the transition frequencies, which you can then convert to probabilities:
for i, j in zip(sequence[:-1], sequence[1:]):
prob_matrix[i][j] += 1
zip(stripped_sequence[:-1], stripped_sequence[1:])
generates a list of pairs of amino acids representing the transitions, e.g. [('D', 'I'), ('I', 'T'), ...]
. It works by pairing up the amino acids in the sequence missing the last amino acid, with the sequence missing the first amino acid.
Upvotes: 2