malcolm
malcolm

Reputation: 33

perl do something once in a while loop

I am frequently working with biological sequence data (FASTAs) that have the following format, where a leading left angle bracket is used as a delimiter to indicate a new sequence header. These files often have text wrapping (except for the headers):

>header1
ACTGACTGACTGACTG
ACTGACTGACTGACTG
>header2
CTGGGACTAGGGGGAG
CTGGGACTAGGGGGAG

Often, I want to avoid reading the entire file into memory because it can be many MBs (sometimes GBs), so I try to focus on while loops and reading line by line. However, I find myself often needing to add extra code to do something unique at the top or bottom of the file. For instance, today I wanted to remove the text wrapping of some file, which seemed as simple as:

while (my $line = <$inputfasta_fh>) {
    chomp($line);
    if ($line =~ /^>/) {
        print $outputfasta_fh "$line\n";
    }
    else {
        print $outputfasta_fh $line;
    }
}

But, I realized I need a newline before all the headers except the first one (otherwise they would be concatenated to end of the the previous sequence). So, this is my crude work-around.

my $switch = 0;
while (my $line = <$inputfasta_fh>) {
    chomp($line);
    if ($line =~ /^>/) {
        if ($switch == 1) {
            print $outputfasta_fh "\n";
        }
        print $outputfasta_fh "$line\n";
        $switch = 1;
    }
    else {
        print $outputfasta_fh $line;
    }
}

Previously, I've had other issues where I need to do something with the last line. For example, I had a script that would read a fasta, store each header and then start counting its sequence length (again line-by-line), and if it was within a range I specified, I saved it to another file. The counting would abort if the length went past the maximum, but I wouldn't know if it was more than the minimum until I reached another header or the end of the file. In the latter case, I had to repeat the length-checking subroutine below the while loop. I would like to avoid repeating that last part.

my $length = 0;
my $header;
my @line_array;

while (my $line = <$inputfasta_fh>) {
    chomp($line);
    if ($line =~ /^>/) {
        # check if previous sequence had a length within range
        if (check_length($length, $minlength, $maxlength) == 1) {
            print $outputfasta_fh "$header\n";
            print $outputfasta_fh join ("\n", @line_array), "\n";
        }
        undef @line_array;
        $header = $line;
        $length = 0;
    }
    else {
        if ($length <= $maxlength) { # no point in measuring any more
            push (@linearray, $line);
            $length += length($line);
        }
    }
}

#and now for the last sequence
if (check_length($length, $minlength, $maxlength) == 1) {
    print $outputfasta_fh "$header\n";
    print $outputfasta_fh join ("\n", @line_array), "\n";
}

sub check_length {
    my ($length, $minlength, $maxlength) = @_;
    if (($length >= $minlength) && ($length <= $maxlength)) {
        return 1;
    }
    else {
        return 0;
    }
}

So, my basic question is how to indicate I want to do something once in a loop without resorting to counters or repeating code outside the loop? Thanks for any help!

Upvotes: 1

Views: 1489

Answers (2)

Chris Charley
Chris Charley

Reputation: 6573

Here are solutions to the 2 problems you described. They are solved using modules from the BioPerl distribution. In this case the Bio::SeqIO module to open files and the Bio::Seq module for some methods it provides, (length, width). You can see how they simplify the solutions!

#!/usr/bin/perl
use strict;
use warnings;
use Bio::SeqIO;

my $in  = Bio::SeqIO->new( -file   => "input1.txt" ,
                           -format => 'fasta');
my $out = Bio::SeqIO->new( -file   => '>test.dat',
                           -format => 'fasta');

while ( my $seq = $in->next_seq() ) {
    $out->width($seq->length); # sequence on 1 line.
    $out->write_seq($seq);
}

my ($minlen, $maxlen) = (40, 1000);

while ( my $seq = $in->next_seq() ){
    my $len = $seq->length;
    out->write_seq($seq) if $minlen <= $len && $len <= $maxlen;
}

It would be worth your while to look into the modules - as you can see from these 2 examples, the resultant code is much more concise and easier to follow. You could look around the BioPerl wiki. The HOWTOs give some examples that you can use right away.

Upvotes: 4

ilomambo
ilomambo

Reputation: 8350

It is not clear what exactly you want to achieve.
But if you know for sure the special cases are the first line or the last line you have several ways to deal with it:

Special first line that does not need regular processing

Process first line
$line = <$INPUT>;
... process line

Regular processing
while(<$INPUT>) {
... process lines
}

Special first line which also need regular processing

Process first line
$line = <$INPUT>;
... process line

Regular processing
do {
... process lines
} while(<$INPUT>);

Special last line,

here you don't have a way to identify the last line beforehand so you have to do it in the loop (unless you know exactly how many lines are there and use a for loop for the first N-1 an then process the last line separately)

while(<$INPUT>) {
   break if islastline();
   ... process lines
}
... process last line

or

while(<$INPUT>) {
   ... process lines
   break if islastline();
}
... process last line

or

for($i=0; $i<N-1 ; $i++) {
   $line = <$INPUT>;
   ...process lines
}
$line = <$INPUT>
... process last line

The other situation you describe, where you need to count and after you are done, the loop continues but you don't need to count anymore is different. If you are concerned about the code looking "clean" of counting just split the loop in two:

Inner temporary processing

first part does the whole package
while(<$INPUT>) {
   ...regular processing
   ...special processing
   break if specialProcessingDone();
}

second part does not need to do special processing anymore
while(<$INPUT>) {
   ...regular processing
}

Upvotes: 1

Related Questions