Reputation:
I have some gene sequence (see below), and I want to find all open reading frame (start with ATG and stop TAG).
I have tried this:
my $file = ('ACCCTGCCCAAAATCCCCCCGATCGATAGAGCTAAATGGCCCATGATGCATCGACTAGCTAGCTAAAATGTCGATCGATACAGCTAATAG');
while($file =~ /(ATG\w+?TAG)/g){
print $1;
}
but it only gives
ATGGCCCATGATGCATCGACTAGATGTCGATCGATACAGCTAATAG
how can i get every one?
Upvotes: 1
Views: 608
Reputation: 64
If you want to have the start and stop codons in the same frame don't forget to filter the results to the only ones with a length multiple of 3:
print "ATG$1\n" if (length($1)%3) == 0 ;
If you want to check the six frames available in one sequence, don't forget to check also the complementary chain:
$comp_chain = reverse($chain) ;
$comp_chain =~ tr/ATCG/TAGC/ ;
You will then obtain the open reading frames from the six reading frames available in a single sequence.
Upvotes: 0
Reputation: 57590
You are getting two matches. To see them, I suggest you print some separator between them:
print "$1\n";
Then we get the output:
ATGGCCCATGATGCATCGACTAG
ATGTCGATCGATACAGCTAATAG
If you want to find frames that also occur inside another, then you must make sure to not consume too many characters. Work around that via a looahead:
/ATG(?=([ACTG]*+TAG))/g;
Then print "ATG$1\n"
, Output:
ATGGCCCATGATGCATCGACTAG
ATGATGCATCGACTAG
ATGCATCGACTAG
ATGTCGATCGATACAGCTAATAG
Upvotes: 2
Reputation: 14921
The trick to find all occurences is to use a zero-width assertion, this will prevent "the eating" of our characters: (?=ATG\w+?TAG)
.
The problem with this is that we'll get empty matches, so the solution is to use a group:
(?=(ATG\w+?TAG))
. You will find all occurences in group 1.
Group 1 output:
ATGGCCCATGATGCATCGACTAG
ATGATGCATCGACTAG
ATGCATCGACTAG
ATGTCGATCGATACAGCTAATAG
Upvotes: 3