user4271388
user4271388

Reputation:

Figure out proper match regex

I am relatively new to programming. I am currently learning Perl and I ran into a logical problem that’s preventing me from finishing the script correctly. Any help would be greatly appreciated!! Thank you in advance for your useful insights!

The bulk of the program has already been written, it is the last step that's giving me a headache.

I have a variable $RNA, which gets sequences of nucleotides (acgu) in any order. For example:

$RNA = agcuaggaaggguuuugauag

and so on.

I already created a hash where every 3 nucleotide character or codon (e.g uga) is assigned to a defined amino acid. E.g:

% my AminoAcid   = (
  ggg => "G",
  ...
);

What I want to do is to print the defined amino acids (the letters in upper case) corresponding to the codons in the hash whenever it reads the START CODON aug within the $RNA strings and to stop printing the defined amino acids in the hash whenever it reads the STOP CODON uga.

For example: Suppose $RNA = aaaaugcccgggugaccccccccc. The program should print the corresponding amino acids starting for (aug) and stop when it reads the stop codon (uga) within the string.

NOTE: It should ignore the first three aaa before it reads the START codon (aug) and the ccccccccccccc after it reads the stop codon (uga) and repeat the same process if it sees the start codon aug again anywhere in the string.

I have tried multiple ideas and none of them came even close to depicting the proper way of writing the code for that last part. I probably don't fully get the logic behind it.

Any help would be greatly appreciated. Thanks in advance!!!

Upvotes: 1

Views: 177

Answers (3)

hepcat72
hepcat72

Reputation: 1094

@lucas-trzesniewski's sub is great and very compact, but it has some drawbacks: it doesn't handle interspersed hard returns, only finds the first protein, does not print the first Methionine, and modifies $_/$1 with an implicit return (which is something that I try to avoid). So here's an improvement. Note, I have a very tricked-out translation script of my own which handles a lot more cases (e.g. overlapping reading frames, ambiguous nucleotides, RNA fragments, alternate start codons, multiple stop codons, etc), but if you just want something simple for limited cases, here's a modified version of @lucas-trzesniewski's sub that addresses those issues:

sub getAminoAcids
  {
    my $mrna = @_;
    $mrna =~ s/\s+//sg;
    $mrna = lc($mrna);
    my @proteins = ();
    while($mrna =~ /(aug(?:[acgu]{3})*?)uga/g)
      {
        $cds = $1;
        push(@proteins,"");
        while($cds =~ /(...)/g)
          {$proteins[-1] .= $aminoAcidMap{$1}}
      }
    return(@proteins);
  }

This assumes you don't want to print a stop character in your protein string. It also could have extras like error-checking.

Upvotes: 0

user557597
user557597

Reputation:

This might work, put the logic in a Code group (?{})

Mod for mutilpe lines
Note - if there should ever need to be a re-alignment (by 3's) let me know.
As of now the alignment is 3 non-whitespace + optional whitespace(s), repeat.
This will consume the line breaks while maintaining the 3 boundry - which I
assume is important.

Perl code

use strict;
use warnings;

my %AminoAcid   = (
   aug => "Start codon",
   ccc => "C",
   ggg => "G",
   uuu => "U"
);

my $RNA = '
aaaaugcccgggugacccgggcccgggcccaaaauguuugggcccugacccgggccccccaugccc
gggugacccgggcccgggcccaaaauguuugggcccugacccgggcccccc
aaaaugcccgggugacccgggcccgggcccaaaauguuugggcccugacccgggcccccc
aaaaugcccgggugacccgggcccgggcccaaaauguuugggcccugacccgggcccccc
';
my $on = 0;

$RNA =~ /
     (?:
          ( \S\S\S ) \s*
          (?{
               if ( $^N eq 'aug' ){ $on = 1; print "\n"; }
               elsif ( $^N eq 'uga' ) { $on = 0; }
               if ( $on ) {
                  exists $AminoAcid{ $^N } ?
                    print $AminoAcid{ $^N } :
                    print "[key not found-> '$^N']";
               }
          })
     )+
   /x;

Output

Start codonCG
Start codonUGC
Start codonCG
Start codonUGC
Start codonCG
Start codonUGC
Start codonCG
Start codonUGC

Upvotes: 1

Lucas Trzesniewski
Lucas Trzesniewski

Reputation: 51330

Let's start with this:

my $rna = "aaaaugcccgggugaccccccccc";
my %aminoAcidMap = ( ggg => "G", ccc => "C" );

The first step is to extract the relevant part between aug and uga:

$rna =~ /aug((?:[acgu]{3})*?)uga/ or die;
my $pattern = $1;

This assumes that aug can appear anywhere in the string. Also, it ensures if doesn't stop at uga if it spans two codons.

If you require the start codon to be at an index in the string that is divisible by 3, you can do this instead:

$rna =~ /^(?:[acgu]{3})*?aug((?:[acgu]{3})*?)uga/ or die;
my $pattern = $1;

At this point, $pattern will contain the part between aug and uga.

Now, to do the mapping, you can do:

my $aminoAcids = $pattern =~ s/[acgu]{3}/$aminoAcidMap{$&}/ger;

This will replace each codon with the value from the hash.

If you pack everything into a sub, you get:

sub getAminoAcids {
    local ($_) = @_;
    /aug((?:[acgu]{3})*?)uga/ or return "";
    $1 =~ s/[acgu]{3}/$aminoAcidMap{$&}/ger;
}

Upvotes: 1

Related Questions