user2729360
user2729360

Reputation:

find many matches in nucleotide sequence with a regex

I have some gene sequence (see below), and I want to find all open reading frame (start with ATG and stop TAG).

I have tried this:

my $file = ('ACCCTGCCCAAAATCCCCCCGATCGATAGAGCTAAATGGCCCATGATGCATCGACTAGCTAGCTAAAATGTCGATCGATACAGCTAATAG');

while($file =~ /(ATG\w+?TAG)/g){
    print $1;           
} 

but it only gives

ATGGCCCATGATGCATCGACTAGATGTCGATCGATACAGCTAATAG

how can i get every one?

Upvotes: 1

Views: 608

Answers (4)

Jose
Jose

Reputation: 64

If you want to have the start and stop codons in the same frame don't forget to filter the results to the only ones with a length multiple of 3:

print "ATG$1\n" if (length($1)%3) == 0 ;

If you want to check the six frames available in one sequence, don't forget to check also the complementary chain:

$comp_chain = reverse($chain) ;
$comp_chain =~ tr/ATCG/TAGC/ ;

You will then obtain the open reading frames from the six reading frames available in a single sequence.

Upvotes: 0

amon
amon

Reputation: 57590

You are getting two matches. To see them, I suggest you print some separator between them:

print "$1\n";

Then we get the output:

ATGGCCCATGATGCATCGACTAG
ATGTCGATCGATACAGCTAATAG

If you want to find frames that also occur inside another, then you must make sure to not consume too many characters. Work around that via a looahead:

/ATG(?=([ACTG]*+TAG))/g;

Then print "ATG$1\n", Output:

ATGGCCCATGATGCATCGACTAG
ATGATGCATCGACTAG
ATGCATCGACTAG
ATGTCGATCGATACAGCTAATAG

Upvotes: 2

HamZa
HamZa

Reputation: 14921

The trick to find all occurences is to use a zero-width assertion, this will prevent "the eating" of our characters: (?=ATG\w+?TAG).

The problem with this is that we'll get empty matches, so the solution is to use a group:
(?=(ATG\w+?TAG)). You will find all occurences in group 1.

Group 1 output:

ATGGCCCATGATGCATCGACTAG
ATGATGCATCGACTAG
ATGCATCGACTAG
ATGTCGATCGATACAGCTAATAG

Online demo

Upvotes: 3

Birei
Birei

Reputation: 36252

Result is ok, simply separate them in output:

print "$1\n";

Upvotes: 2

Related Questions