Lucia
Lucia

Reputation: 911

Calculate the length of a string in a specific file format with perl

I am trying to both learn perl and use it in my research. I need to do a simple task which is counting the number of sequences and their lengths in a file such as follow:

>sequence1
ATCGATCGATCG
>sequence2
AAAATTTT
>sequence3
CCCCGGGG  

The output should look like this:

sequence1 12
sequence2 8
sequence3 8
Total number of sequences = 3

This is the code I have written which is very crude and simple:

#!/usr/bin/perl

use strict;
use warnings;

my ($input, $output) = @ARGV; 

open(INFILE, '<', $input) or die "Can't open $input, $!\n"; # Open a file for reading.
open(OUTFILE, '>', $output) or die "Can't open $output, $!"; # Open a file for writing.

while (<INFILE>) {
    chomp;
    if (/^>/)
    {
        my $number_of_sequences++;
    }else{
        my length = length ($input);
    }
}
print length, number_of_sequences;

close (INFILE);

I'd be grateful if you could give me some hints, for example, in the else block, when I use the length function, I am not sure what argument I should pass into it.

Thanks in advance

Upvotes: 2

Views: 969

Answers (4)

equihua
equihua

Reputation: 1

#!/usr/bin/perl
use strict;
use warnings;

my ($file, $line, $length, $tag, $count);


$file = $ARGV[0];

open (FILE, "$file") or print"can't open file $file\n";
while (<FILE>){
    $line=$_;
    chomp $line;

    if ($line=~/^>/){
        $tag = $line;
    }
    else{
        $length = length ($line);
        $count=1;
    }
    if ($count==1){
        print "$tag\t$length\n";
        $count=0
    }
}

close FILE;

Upvotes: 0

TLP
TLP

Reputation: 67900

Since you have indicated that you want feedback on your program, here goes:

my ($input, $output) = @ARGV; 

open(INFILE, '<', $input) or die "Can't open $input, $!\n"; # Open a file for reading.
open(OUTFILE, '>', $output) or die "Can't open $output, $!"; # Open a file for writing.

Personally, I think when dealing with a simple input/output file relation, it is best to just use the diamond operator and standard output. That means that you read from the special file handle <>, commonly referred to as "the diamond operator", and you print to STDOUT, which is the default output. If you want to save the output in a file, just use shell redirection:

perl program.pl input.txt > output.txt

In this part:

    my $number_of_sequences++;

you are creating a new variable. This variable will go out of scope as soon as you leave the block { .... }, in this case: the if-block.

In this part:

    my length = length ($input);

you forgot the $ sigil. You are also using length on the file name, not the line you read. If you want to read a line from your input, you must use the file handle:

my $length = length(<INFILE>);

Although this will also include the newline in the length.

Here you have forgotten the sigils again:

print length, number_of_sequences;

And of course, this will not create the expected output. It will print something like sequence112.

Recommendations:

  • Use a while (<>) loop to read your input. This is the idiomatic method to use.
  • You do not need to keep a count of your input lines, there is a line count variable: $.. Though keep in mind that it will also count "bad" lines, like blank lines or headers. Using your own variable will allow you to account for such things.
  • Remember to chomp the line before finding out its length. Or use an alternative method that only counts the characters you want: my $length = ( <> =~ tr/ATCG// ) This will read a line, count the letters ATGC, return the count and discard the read line.

Summary:

use strict;
use warnings;   # always use these two pragmas

my $count;
while (<>) {
    next unless /^>/;  # ignore non-header lines
    $count++;          # increment counter
    chomp;    
    my $length = (<> =~ tr/ATCG//);   # get length of next line
    s/^>(\S+)/$1 $length\n/;              # remove > and insert length
} continue {
        print;             # print to STDOUT
    }
print "Total number is sequences = $count\n";

Note the use of continue here, which will allow us to skip a line that we do not want to process, but that will still get printed.

And as I said above, you can redirect this to a file if you want.

Upvotes: 1

ErikR
ErikR

Reputation: 52029

For starters, you need to change your inner loop to this:

...
  chomp;
  if (/^>/)
  {
      $number_of_sequences++;
      $sequence_name = $_;
  }else{
      print "$sequence_name ", length($input), "\n";
  }
...

Note the following:

  • The my declaration has been removed from $number_of_sequences
  • The sequence name is captured in the variable $sequence_name. It is used later when the next line is read.

To make the script run under strict mode, you can add my declarations for $number_of_sequences and $sequence_name outside of the loop:

my $sequence_name;
my $number_of_sequences = 0;
while (<INFILE>) {
 ...(as above)...
}
print "Total number of sequences: $number_of_sequences\n";

The my keyword declares a new lexically scoped variable - i.e. a variable which only exists within a certain block of code, and every time that block of code is entered, a new version of that variable is created. Since you want to have the value of $sequence_name carry over from one loop iteration to the next you need to place the my outside of the loop.

Upvotes: 0

Paul Roub
Paul Roub

Reputation: 36438

You're printing out just the last length, not each sequence length, and you want to catch the sequence names as you go:

#!/usr/bin/perl

use strict;
use warnings;

my ($input, $output) = @ARGV; 
my ($lastSeq, $number_of_sequences) = ('', 0);

open(INFILE, '<', $input) or die "Can't open $input, $!\n"; # Open a file for reading.

# You never use OUTFILE
# open(OUTFILE, '>', $output) or die "Can't open $output, $!"; # Open a file for writing.

while (<INFILE>) {
    chomp;
    if (/^>(.+)/)
    {
        $lastSeq = $1;
        $number_of_sequences++;
    } 
    else
    {
        my $length = length($_);
        print "$lastSeq $length\n";
    }
}

print "Total number of sequences = $number_of_sequences\n";

close (INFILE);

Upvotes: 1

Related Questions