python_idiot
python_idiot

Reputation: 43

calculate all possible combinations of RNA codons for a protein sequence

i have a protein sequence:

sequence_protein = 'IEEATHMTPCYELHGLRWVQIQDYAINVMQCL'

and a tRNA codon table for every protein:

codon_table = {
'A': ('GCT', 'GCC', 'GCA', 'GCG'),
'C': ('TGT', 'TGC'),
'D': ('GAT', 'GAC'),
'E': ('GAA', 'GAG'),
'F': ('TTT', 'TTC'),
'G': ('GGT', 'GGC', 'GGA', 'GGG'),
'H': ('CAT', 'CAC'),
'I': ('ATT', 'ATC', 'ATA'),
'K': ('AAA', 'AAG'),
'L': ('TTA', 'TTG', 'CTT', 'CTC', 'CTA', 'CTG'),
'M': ('ATG',),
'N': ('AAT', 'AAC'),
'P': ('CCT', 'CCC', 'CCA', 'CCG'),
'Q': ('CAA', 'CAG'),
'R': ('CGT', 'CGC', 'CGA', 'CGG', 'AGA', 'AGG'),
'S': ('TCT', 'TCC', 'TCA', 'TCG', 'AGT', 'AGC'),
'T': ('ACT', 'ACC', 'ACA', 'ACG'),
'V': ('GTT', 'GTC', 'GTA', 'GTG'),
'W': ('TGG',),
'Y': ('TAT', 'TAC'),}

i then wrote a function that would give a tuple with the possible codons for every protein:

tRNA = []
for i in sequence_protein:
    for residue in i:
        tRNA.append(codon_table[residue])

which gave this output:

[('ATT', 'ATC', 'ATA'),
 ('GAA', 'GAG'),
 ('GAA', 'GAG'),
 ('GCT', 'GCC', 'GCA', 'GCG'),
 ('ACT', 'ACC', 'ACA', 'ACG'),
 ('CAT', 'CAC'),
 ('ATG',),
 ('ACT', 'ACC', 'ACA', 'ACG'),
 ('CCT', 'CCC', 'CCA', 'CCG'),
 ('TGT', 'TGC'),
 ('TAT', 'TAC'),
 ('GAA', 'GAG'),
 ('TTA', 'TTG', 'CTT', 'CTC', 'CTA', 'CTG'),
 ('CAT', 'CAC'),
 ('GGT', 'GGC', 'GGA', 'GGG'),
 ('TTA', 'TTG', 'CTT', 'CTC', 'CTA', 'CTG'),
 ('CGT', 'CGC', 'CGA', 'CGG', 'AGA', 'AGG'),
 ('TGG',),
 ('GTT', 'GTC', 'GTA', 'GTG'),
 ('CAA', 'CAG'),
 ('ATT', 'ATC', 'ATA'),
 ('CAA', 'CAG'),
 ('GAT', 'GAC'),
 ('TAT', 'TAC'),
 ('GCT', 'GCC', 'GCA', 'GCG'),
 ('ATT', 'ATC', 'ATA'),
 ('AAT', 'AAC'),
 ('GTT', 'GTC', 'GTA', 'GTG'),
 ('ATG',),
 ('CAA', 'CAG'),
 ('TGT', 'TGC'),
 ('TTA', 'TTG', 'CTT', 'CTC', 'CTA', 'CTG')]

is there a way to compute all possible codon combinations for the sequence (basically calculate the products for all the seperate elements in the tuple)? and also count the amount of products there would be without generating the sequences first?

i tried using the product function but that crashed my notebook :s

combs = []
for a in product(*tRNA):
    combs.append(a)
print(a)

Upvotes: 3

Views: 647

Answers (2)

aerijman
aerijman

Reputation: 2762

To calculate the total number of combinations:

sequence_protein = 'IEEATHMTPCYELHGLRWVQIQDYAINVMQCL'
total_number_combinations = np.prod([ len(codon_table[aa]) for aa in sequence_protein ])

To generate all possible combinations:

The most elegant is itertools:

from itertools import product

tRNA = [codon_table[aa] for aa in sequence_protein]
for i in product(*tRNA):
    #...do whatever you have to do with these combinations.

but you can use a custom function. Just use yield so that you don't generate all sequences at once and avoid memory problems.

Upvotes: 2

Synthaze
Synthaze

Reputation: 6090

import itertools

list_codons = [('ATT', 'ATC', 'ATA'),
 ('GAA', 'GAG'),
 ('GAA', 'GAG'),
 ('GCT', 'GCC', 'GCA', 'GCG'),
 ('ACT', 'ACC', 'ACA', 'ACG'),
 ('CAT', 'CAC'),
 ('ATG',),
 ('ACT', 'ACC', 'ACA', 'ACG'),
 ('CCT', 'CCC', 'CCA', 'CCG'),
 ('TGT', 'TGC'),
 ('TAT', 'TAC'),
 ('GAA', 'GAG'),
 ('TTA', 'TTG', 'CTT', 'CTC', 'CTA', 'CTG'),
 ('CAT', 'CAC'),
 ('GGT', 'GGC', 'GGA', 'GGG'),
 ('TTA', 'TTG', 'CTT', 'CTC', 'CTA', 'CTG'),
 ('CGT', 'CGC', 'CGA', 'CGG', 'AGA', 'AGG'),
 ('TGG',),
 ('GTT', 'GTC', 'GTA', 'GTG'),
 ('CAA', 'CAG'),
 ('ATT', 'ATC', 'ATA'),
 ('CAA', 'CAG'),
 ('GAT', 'GAC'),
 ('TAT', 'TAC'),
 ('GCT', 'GCC', 'GCA', 'GCG'),
 ('ATT', 'ATC', 'ATA'),
 ('AAT', 'AAC'),
 ('GTT', 'GTC', 'GTA', 'GTG'),
 ('ATG',),
 ('CAA', 'CAG'),
 ('TGT', 'TGC'),
 ('TTA', 'TTG', 'CTT', 'CTC', 'CTA', 'CTG')]

counter = 0; max_proc = 1000000; list_seq = []

for x in itertools.product(*list_codons):
    counter += 1
    if counter % max_proc == 0:
        #Do your stuff by slice and clear the list
        list_seq = []
    list_seq.append(x)
    print (counter)
    print (x)

And that's it, no more RAM problem

Upvotes: 1

Related Questions