Reputation: 3
I am trying to generate a nucleotide motif that will code chosen amino acids. For example - histidine is coded by CAT, CAC. Arginine is CGT, CGC, CGA, CGG,AGA and AGG. The pattern is:
position in codon - C or A
position in codon - A or G
position - A, T, C or G
by that rule you can define chosen amino acids (H and R) but also the amino acids that i dont want (for example AAA is lysine, AAT is asparagine...). So I need to define the pattern that matches only my chosen AAs, in case above it can be: [C][A or G][T], that pattern defines only histidine and arginine, but not the other amino acids. I am trying to work out an algorithm which will do this thing with any amino acids which i choose (more than two) and if the pattern does not exist it should find the possibilities for less amino acids (for example if pattern for 5 amino acids does not exist, it will find the patterns for four amino acids from the query) - this final optimization problem is probably the hardest part. Any suggestions? Thanks a lot and sorry for my poor english.
Upvotes: 0
Views: 276
Reputation: 3
The problem was solved:
1) I made a library of all codons from the aminoacids of choise (ex. Met and Trp are AUG and UGG - so library of all combinations consists of [A/U][U/G][G] - AUG,AGG,UUG,UGG
2) Created two lists - first of "good" aminoacids - AUG, UGG and second of "bad" - AGG, UUG
3) Calculated the amount of "good" aminoacids left, if I remove a specific nucleotide (say if I remove U in second position I lose my AUG for methionine - so for U in second position the number is 1) for each nucleotide in each position
4) Next "bad" codons were analyzed for occurence of each nucleotide in each position (for codons AGG and UUG I have 2x G in third, 1x G in second, 1x U in first etc.)
5) After these steps I simply took the highest number from the step 4), looked to the list in step 3) and if the G in third position can be harmlessly removed without any losses in good aminoacids (not possible in our example, but can be done in larger codon sets) - I remove all codons with G in third position, edit the list of "good" and "bad" codons, and proceed again to step 3)
Upvotes: 0
Reputation: 51501
I would do this in two steps. First, translate the nucleotide sequence into the amino acid sequence, using a mapping of codon to amino acid (CAT
maps to H
, CAC
maps to H
, CGT
maps to R
, CGC
maps to R
, etc.). Second, use the Boyer-Moore algorithm to search for specific amino acid sequences, or regular expressions if you need "wildcards" or groups of options.
Upvotes: 2
Reputation: 1
I am more of PHP, JS developer, but the logic for the code remains the same. The below image is a chart which defines the different amino acids and its codes.
http://wang.salk.edu/images/fig2.jpg
I would suggest you do somthing like this: $all_aas = { CGC => A, CGA => A, CGG=> A, AGA=> A, ....}; //define all amino acids
$chosen_aas = {CAT=>H, CGA => A]; //define all those amino acids which you chose
$lesser_aas = {CGT=>H, CGC=>H]; //define those amino acids which are less preferred
$final_aa_seq = ''; //final Amino acid string
I am using Python's Dictionaries above to do it. Its basically a key-value pair.
Now, whenever you get a nulcoetide sequence. All you need to do is:
Run a for loop, apply substring to find three characters set.
Search this in the $chosen_aas array for a match; append the found code to
$final_aa_seq
If not found in $chosen_aas, search the string in $lesser_aas; append the found code to
$final_aa_seq
Run through all the complete for loop and output the string.
Let me know, if you need any more logic to this.
Upvotes: 0