perl do something once in a while loop

Question

I am frequently working with biological sequence data (FASTAs) that have the following format, where a leading left angle bracket is used as a delimiter to indicate a new sequence header. These files often have text wrapping (except for the headers):

>header1
ACTGACTGACTGACTG
ACTGACTGACTGACTG
>header2
CTGGGACTAGGGGGAG
CTGGGACTAGGGGGAG

Often, I want to avoid reading the entire file into memory because it can be many MBs (sometimes GBs), so I try to focus on while loops and reading line by line. However, I find myself often needing to add extra code to do something unique at the top or bottom of the file. For instance, today I wanted to remove the text wrapping of some file, which seemed as simple as:

while (my $line = <$inputfasta_fh>) {
    chomp($line);
    if ($line =~ /^>/) {
        print $outputfasta_fh "$line
";
    }
    else {
        print $outputfasta_fh $line;
    }
}

But, I realized I need a newline before all the headers except the first one (otherwise they would be concatenated to end of the the previous sequence). So, this is my crude work-around.

my $switch = 0;
while (my $line = <$inputfasta_fh>) {
    chomp($line);
    if ($line =~ /^>/) {
        if ($switch == 1) {
            print $outputfasta_fh "
";
        }
        print $outputfasta_fh "$line
";
        $switch = 1;
    }
    else {
        print $outputfasta_fh $line;
    }
}

Previously, I've had other issues where I need to do something with the last line. For example, I had a script that would read a fasta, store each header and then start counting its sequence length (again line-by-line), and if it was within a range I specified, I saved it to another file. The counting would abort if the length went past the maximum, but I wouldn't know if it was more than the minimum until I reached another header or the end of the file. In the latter case, I had to repeat the length-checking subroutine below the while loop. I would like to avoid repeating that last part.

my $length = 0;
my $header;
my @line_array;

while (my $line = <$inputfasta_fh>) {
    chomp($line);
    if ($line =~ /^>/) {
        # check if previous sequence had a length within range
        if (check_length($length, $minlength, $maxlength) == 1) {
            print $outputfasta_fh "$header
";
            print $outputfasta_fh join ("
", @line_array), "
";
        }
        undef @line_array;
        $header = $line;
        $length = 0;
    }
    else {
        if ($length <= $maxlength) { # no point in measuring any more
            push (@linearray, $line);
            $length += length($line);
        }
    }
}

#and now for the last sequence
if (check_length($length, $minlength, $maxlength) == 1) {
    print $outputfasta_fh "$header
";
    print $outputfasta_fh join ("
", @line_array), "
";
}

sub check_length {
    my ($length, $minlength, $maxlength) = @_;
    if (($length >= $minlength) && ($length <= $maxlength)) {
        return 1;
    }
    else {
        return 0;
    }
}

So, my basic question is how to indicate I want to do something once in a loop without resorting to counters or repeating code outside the loop? Thanks for any help!

Chris Charley · Accepted Answer

Here are solutions to the 2 problems you described. They are solved using modules from the BioPerl distribution. In this case the Bio::SeqIO module to open files and the Bio::Seq module for some methods it provides, (length, width). You can see how they simplify the solutions!

#!/usr/bin/perl
use strict;
use warnings;
use Bio::SeqIO;

my $in  = Bio::SeqIO->new( -file   => "input1.txt" ,
                           -format => 'fasta');
my $out = Bio::SeqIO->new( -file   => '>test.dat',
                           -format => 'fasta');

while ( my $seq = $in->next_seq() ) {
    $out->width($seq->length); # sequence on 1 line.
    $out->write_seq($seq);
}

my ($minlen, $maxlen) = (40, 1000);

while ( my $seq = $in->next_seq() ){
    my $len = $seq->length;
    out->write_seq($seq) if $minlen <= $len && $len <= $maxlen;
}

It would be worth your while to look into the modules - as you can see from these 2 examples, the resultant code is much more concise and easier to follow. You could look around the BioPerl wiki. The HOWTOs give some examples that you can use right away.

perl do something once in a while loop

Answers (2)

Special first line that does not need regular processing

Special first line which also need regular processing

Special last line,

Inner temporary processing

Related Questions