Reputation: 43
i have a protein sequence:
sequence_protein = 'IEEATHMTPCYELHGLRWVQIQDYAINVMQCL'
and a tRNA codon table for every protein:
codon_table = {
'A': ('GCT', 'GCC', 'GCA', 'GCG'),
'C': ('TGT', 'TGC'),
'D': ('GAT', 'GAC'),
'E': ('GAA', 'GAG'),
'F': ('TTT', 'TTC'),
'G': ('GGT', 'GGC', 'GGA', 'GGG'),
'H': ('CAT', 'CAC'),
'I': ('ATT', 'ATC', 'ATA'),
'K': ('AAA', 'AAG'),
'L': ('TTA', 'TTG', 'CTT', 'CTC', 'CTA', 'CTG'),
'M': ('ATG',),
'N': ('AAT', 'AAC'),
'P': ('CCT', 'CCC', 'CCA', 'CCG'),
'Q': ('CAA', 'CAG'),
'R': ('CGT', 'CGC', 'CGA', 'CGG', 'AGA', 'AGG'),
'S': ('TCT', 'TCC', 'TCA', 'TCG', 'AGT', 'AGC'),
'T': ('ACT', 'ACC', 'ACA', 'ACG'),
'V': ('GTT', 'GTC', 'GTA', 'GTG'),
'W': ('TGG',),
'Y': ('TAT', 'TAC'),}
i then wrote a function that would give a tuple with the possible codons for every protein:
tRNA = []
for i in sequence_protein:
for residue in i:
tRNA.append(codon_table[residue])
which gave this output:
[('ATT', 'ATC', 'ATA'),
('GAA', 'GAG'),
('GAA', 'GAG'),
('GCT', 'GCC', 'GCA', 'GCG'),
('ACT', 'ACC', 'ACA', 'ACG'),
('CAT', 'CAC'),
('ATG',),
('ACT', 'ACC', 'ACA', 'ACG'),
('CCT', 'CCC', 'CCA', 'CCG'),
('TGT', 'TGC'),
('TAT', 'TAC'),
('GAA', 'GAG'),
('TTA', 'TTG', 'CTT', 'CTC', 'CTA', 'CTG'),
('CAT', 'CAC'),
('GGT', 'GGC', 'GGA', 'GGG'),
('TTA', 'TTG', 'CTT', 'CTC', 'CTA', 'CTG'),
('CGT', 'CGC', 'CGA', 'CGG', 'AGA', 'AGG'),
('TGG',),
('GTT', 'GTC', 'GTA', 'GTG'),
('CAA', 'CAG'),
('ATT', 'ATC', 'ATA'),
('CAA', 'CAG'),
('GAT', 'GAC'),
('TAT', 'TAC'),
('GCT', 'GCC', 'GCA', 'GCG'),
('ATT', 'ATC', 'ATA'),
('AAT', 'AAC'),
('GTT', 'GTC', 'GTA', 'GTG'),
('ATG',),
('CAA', 'CAG'),
('TGT', 'TGC'),
('TTA', 'TTG', 'CTT', 'CTC', 'CTA', 'CTG')]
is there a way to compute all possible codon combinations for the sequence (basically calculate the products for all the seperate elements in the tuple)? and also count the amount of products there would be without generating the sequences first?
i tried using the product function but that crashed my notebook :s
combs = []
for a in product(*tRNA):
combs.append(a)
print(a)
Upvotes: 3
Views: 647
Reputation: 2762
sequence_protein = 'IEEATHMTPCYELHGLRWVQIQDYAINVMQCL'
total_number_combinations = np.prod([ len(codon_table[aa]) for aa in sequence_protein ])
The most elegant is itertools:
from itertools import product
tRNA = [codon_table[aa] for aa in sequence_protein]
for i in product(*tRNA):
#...do whatever you have to do with these combinations.
but you can use a custom function. Just use yield
so that you don't generate all sequences at once and avoid memory problems.
Upvotes: 2
Reputation: 6090
import itertools
list_codons = [('ATT', 'ATC', 'ATA'),
('GAA', 'GAG'),
('GAA', 'GAG'),
('GCT', 'GCC', 'GCA', 'GCG'),
('ACT', 'ACC', 'ACA', 'ACG'),
('CAT', 'CAC'),
('ATG',),
('ACT', 'ACC', 'ACA', 'ACG'),
('CCT', 'CCC', 'CCA', 'CCG'),
('TGT', 'TGC'),
('TAT', 'TAC'),
('GAA', 'GAG'),
('TTA', 'TTG', 'CTT', 'CTC', 'CTA', 'CTG'),
('CAT', 'CAC'),
('GGT', 'GGC', 'GGA', 'GGG'),
('TTA', 'TTG', 'CTT', 'CTC', 'CTA', 'CTG'),
('CGT', 'CGC', 'CGA', 'CGG', 'AGA', 'AGG'),
('TGG',),
('GTT', 'GTC', 'GTA', 'GTG'),
('CAA', 'CAG'),
('ATT', 'ATC', 'ATA'),
('CAA', 'CAG'),
('GAT', 'GAC'),
('TAT', 'TAC'),
('GCT', 'GCC', 'GCA', 'GCG'),
('ATT', 'ATC', 'ATA'),
('AAT', 'AAC'),
('GTT', 'GTC', 'GTA', 'GTG'),
('ATG',),
('CAA', 'CAG'),
('TGT', 'TGC'),
('TTA', 'TTG', 'CTT', 'CTC', 'CTA', 'CTG')]
counter = 0; max_proc = 1000000; list_seq = []
for x in itertools.product(*list_codons):
counter += 1
if counter % max_proc == 0:
#Do your stuff by slice and clear the list
list_seq = []
list_seq.append(x)
print (counter)
print (x)
And that's it, no more RAM problem
Upvotes: 1