Steve
Steve

Reputation: 1097

How can I account for numbers in scientific notation and in decimal form in perl regex?

I'm new to Perl regex so I appreciate any help. I am parsing BLAST outputs. Right now, I can only account for hits where the e-value only contains integers and decimals. How can I include hits where the e-value is in scientific notation?

blastoutput.txt

                                                               Score     E
Sequences producing significant alignments:                       (Bits)  Value

ref|WP_001577367.1|  hypothetical protein [Escherichia coli] >...  75.9    4e-15
ref|WP_001533923.1|  cytotoxic necrotizing factor 1 [Escherich...  75.9    7e-15
ref|WP_001682680.1|  cytotoxic necrotizing factor 1 [Escherich...  75.9    7e-15
ref|ZP_15044188.1|  cytotoxic necrotizing factor 1 domain prot...  40.0    0.002
ref|YP_650655.1|  hypothetical protein YPA_0742 [Yersinia pest...  40.0    0.002

ALIGNMENTS
>ref|WP_001577367.1| hypothetical protein [Escherichia coli]

parse.pl

open (FILE, './blastoutput.txt');
my $marker = 0;
my @one;
my @acc;
my @desc;
my @score;
my @evalue;
my $counter=0;
while(<FILE>){
   chomp;
   if($marker==1){
   if(/^(\D+)\|(.+?)\|\s(.*?)\s(\d+)(\.\d+)? +(\d+)([\.\d+]?) *$/) {
   #if(/^(\D+)\|(.+?)\|\s(.*?)\s(\d+)(\.\d+)? +(\d+)((\.\d+)?(e.*?)?) *$/) 
            $one[$counter] = $1;
            $acc[$counter] = $2;
            $desc[$counter] = $3;
            $score[$counter] = $4+$5;
            if(! $7){
                $evalue[$counter] = $6;
            }else{
                $evalue[$counter] = $6+$7;
            }
            $counter++;
        }
    }
    if(/Sequences producing significant alignments/){
        $marker = 1;
    }elsif(/ALIGNMENTS/){
        $marker = 0;
    }elsif(/No significant similarity found/){
        last;
    }
}
for(my $i=0; $i < scalar(@one); $i++){
    print "$one[$i] | $acc[$i] | $desc[$i] | $score[$i] | $evalue[$i]\n";
}
close FILE;

Upvotes: 0

Views: 860

Answers (3)

Casimir et Hippolyte
Casimir et Hippolyte

Reputation: 89557

You can match a number in scientific notation (or not) with this:

\d+(?:\.\d+)?+(?:e[+-]?\d+)?+

With your code:

if (/^([^|]+)\|([^|]+)\|\s++(.*?)\s(\d+(?:\.\d+)?+)\s+(\d+(?:\.\d+)?+(?:e[+-]?\d+)?+)\s*$/) {
    $one[$counter] = $1;
    $acc[$counter] = $2;
    $desc[$counter] = $3;
    $score[$counter] = $4;
    $evalue[$counter] = $5;
    $counter++;
}

(I have added some possessive quantifiers ++ and ?+ to reduce the number of backtracking steps as much as possible, but the 3th group use a lazy quantifier. The best would be than you use a more precise pattern if possible for the description part.)

Upvotes: 3

SES
SES

Reputation: 870

If this is an assignment or practice with Perl, then take some of the other suggestions and try to figure out the best solution (but don't stop there, you'll also find a lot on the internet and there are even books that cover the topic of parsing BLAST!). In practice though, you would never want to parse a BLAST report this way because your code is not going to be readable and it's not guaranteed to work in the future because the plain report format may change.

I highly recommend you stick to the XML output, or the tab-delimited table formats, and just use BioPerl's Bio::SearchIO to parse your reports. For example, if you take a look at the Bio::SearchIO HOWTO you can see that it is quite easy to select certain parts of your reports and filter by certain criteria without having any Perl knowledge. If you want to come up with a non-BioPerl solution, I would recommend you consider the tab-delimited format to make things easier on yourself in the future (then you can implement the complicated tasks in a way that is manageable and readable).

Upvotes: 0

perreal
perreal

Reputation: 97948

You could also avoid matching those numbers:

while(<FILE>){
    chomp;
    $marker = 0 if $marker and /ALIGNMENTS/;
    if($marker == 1 and my ($r, $w, $d) = split(/[|]/)) {
            my @v = split (/\s+/, $d);
            print "$v[-2]\t$v[-1]\n";
            # some processing ...
    }   
    $marker = 1 if /Sequences producing significant alignments/;
    last        if /No significant similarity found/;
}

Upvotes: 0

Related Questions