zebra
zebra

Reputation: 83

Counting the frequency of bases using while loop and substr with Perl

I am trying to write in Perl to count the number of each A/C/G/T bases in a DNA sequence. But couldn't figure out what went wrong in my code. "ATCTAGCTAGCTAGCTA" is the kind of data I am given.

#!usr/bin/perl
use strict;
use warnings;

my $in_file = <$ARGV[0]>;
open( my $FH_IN, "<", $in_file );

my $dna   = <$FH_IN>;
my $index = 0;
my ( $freq_a, $freq_c, $freq_g, $freq_t ) = 0;

my $dna_length = length($dna);
while ( $index < $dna_length ) {
    my $base = substr( $dna, $index, 1 );
    if ( $base eq "A" ) {
        $freq_a++;
        $index++;
        next;
    } elsif ( $base eq "C" ) {
        $freq_c++;
        $index++;
        next;
    } elsif ( $base eq "G" ) {
        $freq_g++;
        $index++;
        next;
    } elsif ( $base eq "T" ) {
        $freq_t++;
        $index++;
        next;
    } else {
        next;
    }
}
print "$freq_a\n$freq_c\n$freq_g\n$freq_t\n";

exit;

I know there are a lot of ways to do it, but what I want to know is what I did wrong so I can learn from mistakes.

Upvotes: 0

Views: 739

Answers (3)

Borodin
Borodin

Reputation: 126742

Okay, you've done well so far and there's only one problem that stops your program from working.

It's far from obvious, but each line that's read from the file has a newline character "\n" at the end. So what's happening is that $index reaches the newline in the string, which is processed by the else clause (because it's not A, C, G or T) which just does a next, so the same character is processed over and over again. Your program just hangs, right?

You could remove the newline with chomp, but a proper fix is to increment $index in the else clause just as you do with all the other characters. So it looks like

else {
   ++$index;
   next;
}

As you've suspected, there are much better ways to write this. There are also a couple of other nasties in your code, but that change should get you on your way for now.

Upvotes: 1

TLP
TLP

Reputation: 67900

Perl has a special file handle to use with these kinds of problems: The diamond operator <>. It will read input from either a file name, if provided, and standard input if not.

Secondly, since you are only interested in ACGT, might as well look for only them, using a regex: /([ACGT])/g.

Thirdly, using a hash is the idiomatic way to count characters in Perl: $count{A}++

So your script becomes:

use strict;
use warnings;

my %count;
while (<>) {
    while (/([ACGT])/g) {
        $count{$1}++;
    }
}

print "$_\n" for @count{qw(A C G T)};

Usage:

script.pl input.txt 

Upvotes: 2

Dave Cross
Dave Cross

Reputation: 69304

It would be instructive for you to print out the values in $dna_length, $index and $base each time you go round the loop - immediately after you assign a value to $base.

Your code would be more robust if you moved the incrementing of $index to the end of the loop (outside of the if/elsif/else block) and removed all of your next statements.

An alternative "quick fix" is to chomp() the input line before you start processing it.

Upvotes: 1

Related Questions