Reputation: 59

probability matrix of a protein sequence

I am trying to make a probability matrix of this sequence:

'DITCCGQFHFAIIYHDWQYKIFRYAATSPVKEPWKHRMWYSIVAANDVENCNSFHGPYQQ KHQWQDNTAQYLEYKTIGYQKRDQPNNVWIHHPMVYYEPVHYRQFNDRQAFTYSDQFCSK SCTIIWNGEANQCHNKQTASDHTGWPRMFAYLKENYTQYSTFFICMLDKYTCSNMKSLPE MHWELMEWALMCSCEKERARYQCNSWRKSIADPEFNYCIAWMFCKHEEKGEETRCEQKHQ ALLPPHEDYGDSLNDCQVNNGEPYTTKGEQRVKLQKEGHKNEQCRKATKRKYQASQCEAK REMMKNWRSYTATESNARVMQHWRQWRLHSMCVITDDHTQRRETCEAKENRMLRTALHIW VVWASHWFPVMNITQIWTGEDHGDHNSFLALCDSVVASYRILEQQLECCPNEDQCPMSIF HYKVKMCWEWRIVYAPNQSHTRNCALDFKKMEPIAGMMHCPGMQSGMLTSDRPVLEPGSV ENPLFDNHVRFSYFFEQVNNGKFMLECSTCGDNEEIFGYHCIVQNYQDCASAKSAIFCFM FANQHAERGWSPGLIVRNF'

A protein of amino acid sequences.

With as alphabet:

alphabet = ('A', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'K', 'L','M', 'N', 'P', 'Q', 'R', 'S', 'T', 'V', 'W', 'Y')

I created an empty matrix:

prob_matrix = {}
for i in alphabet:
    prob_matrix[i] = {}
    for j in alphabet:
        prob_matrix[i][j] = 0.0

But I am struggeling to fill in this matrix with numbers based on my sequence. Can anyone help me with this formula?

Afterwards I can convert this to probabilities with this funtion:

for row in prob_matrix:
   total = sum([prob_matrix[row][column] for column in prob_matrix[row]])
   if total > 0:
       for column in prob_matrix[row]:
           prob_matrix[row][column] /= total

If that is correct.

Can anyone help me with the step inbetween? Or help me with creating a whole new formula?

Upvotes: 1

Answers (2)

RoadRunner

Reputation: 26325

I would probably start off by stripping the whitespace from your sequence string:

sequence = "DITCCGQFHFAIIYHDWQYKIFRYAATSPVKEPWKHRMWYSIVAANDVENCNSFHGPYQQ KHQWQDNTAQYLEYKTIGYQKRDQPNNVWIHHPMVYYEPVHYRQFNDRQAFTYSDQFCSK SCTIIWNGEANQCHNKQTASDHTGWPRMFAYLKENYTQYSTFFICMLDKYTCSNMKSLPE MHWELMEWALMCSCEKERARYQCNSWRKSIADPEFNYCIAWMFCKHEEKGEETRCEQKHQ ALLPPHEDYGDSLNDCQVNNGEPYTTKGEQRVKLQKEGHKNEQCRKATKRKYQASQCEAK REMMKNWRSYTATESNARVMQHWRQWRLHSMCVITDDHTQRRETCEAKENRMLRTALHIW VVWASHWFPVMNITQIWTGEDHGDHNSFLALCDSVVASYRILEQQLECCPNEDQCPMSIF HYKVKMCWEWRIVYAPNQSHTRNCALDFKKMEPIAGMMHCPGMQSGMLTSDRPVLEPGSV ENPLFDNHVRFSYFFEQVNNGKFMLECSTCGDNEEIFGYHCIVQNYQDCASAKSAIFCFM FANQHAERGWSPGLIVRNF"

stripped_sequence = sequence.replace(" ", "")

Then get the totals for each letter in the sequence using collections.Counter():

from collections import Counter

totals = Counter(stripped_sequence)
# Counter({'E': 41, 'Q': 37, 'N': 34, 'A': 33, 'K': 32, 'S': 32, 'C': 31, 'R': 29, 'H': 28, 'Y': 28, 'T': 26, 'F': 26, 'D': 24, 'I': 24, 'M': 24, 'V': 23, 'L': 23, 'P': 22, 'G': 21, 'W': 21})

And now you can get your probabilties by dividing the number of each sequence letter by the total number of sequence letters:

sequence_len = len(stripped_sequence)
probabilities = {
    letter: totals[letter] / sequence_len for letter in alphabet
}

print(probabilities)

Which Outputs the following probablities:

{'A': 0.059033989266547404, 'C': 0.055456171735241505, 'D': 0.04293381037567084, 'E': 0.07334525939177101, 'F': 0.046511627906976744, 'G': 0.03756708407871199, 'H': 0.05008944543828265, 'I': 0.04293381037567084, 'K': 0.057245080500894455, 'L': 0.04114490161001789, 'M': 0.04293381037567084, 'N': 0.06082289803220036, 'P': 0.03935599284436494, 'Q': 0.06618962432915922, 'R': 0.0518783542039356, 'S': 0.057245080500894455, 'T': 0.046511627906976744, 'V': 0.04114490161001789, 'W': 0.03756708407871199, 'Y': 0.05008944543828265}

Which should sum near to 1.

Upvotes: 2

Peter Collingridge

Reputation: 10979

This should get you the transition frequencies, which you can then convert to probabilities:

for i, j in zip(sequence[:-1], sequence[1:]):
    prob_matrix[i][j] += 1

zip(stripped_sequence[:-1], stripped_sequence[1:]) generates a list of pairs of amino acids representing the transitions, e.g. [('D', 'I'), ('I', 'T'), ...]. It works by pairing up the amino acids in the sequence missing the last amino acid, with the sequence missing the first amino acid.

Upvotes: 2

probability matrix of a protein sequence

Answers (2)

Related Questions