Reputation: 83
I am trying to write in Perl to count the number of each A/C/G/T bases in a DNA sequence. But couldn't figure out what went wrong in my code. "ATCTAGCTAGCTAGCTA" is the kind of data I am given.
#!usr/bin/perl
use strict;
use warnings;
my $in_file = <$ARGV[0]>;
open( my $FH_IN, "<", $in_file );
my $dna = <$FH_IN>;
my $index = 0;
my ( $freq_a, $freq_c, $freq_g, $freq_t ) = 0;
my $dna_length = length($dna);
while ( $index < $dna_length ) {
my $base = substr( $dna, $index, 1 );
if ( $base eq "A" ) {
$freq_a++;
$index++;
next;
} elsif ( $base eq "C" ) {
$freq_c++;
$index++;
next;
} elsif ( $base eq "G" ) {
$freq_g++;
$index++;
next;
} elsif ( $base eq "T" ) {
$freq_t++;
$index++;
next;
} else {
next;
}
}
print "$freq_a\n$freq_c\n$freq_g\n$freq_t\n";
exit;
I know there are a lot of ways to do it, but what I want to know is what I did wrong so I can learn from mistakes.
Upvotes: 0
Views: 739
Reputation: 126742
Okay, you've done well so far and there's only one problem that stops your program from working.
It's far from obvious, but each line that's read from the file has a newline character "\n"
at the end. So what's happening is that $index
reaches the newline in the string, which is processed by the else
clause (because it's not A
, C
, G
or T
) which just does a next
, so the same character is processed over and over again. Your program just hangs, right?
You could remove the newline with chomp
, but a proper fix is to increment $index
in the else
clause just as you do with all the other characters. So it looks like
else {
++$index;
next;
}
As you've suspected, there are much better ways to write this. There are also a couple of other nasties in your code, but that change should get you on your way for now.
Upvotes: 1
Reputation: 67900
Perl has a special file handle to use with these kinds of problems: The diamond operator <>
. It will read input from either a file name, if provided, and standard input if not.
Secondly, since you are only interested in ACGT
, might as well look for only them, using a regex: /([ACGT])/g
.
Thirdly, using a hash is the idiomatic way to count characters in Perl: $count{A}++
So your script becomes:
use strict;
use warnings;
my %count;
while (<>) {
while (/([ACGT])/g) {
$count{$1}++;
}
}
print "$_\n" for @count{qw(A C G T)};
Usage:
script.pl input.txt
Upvotes: 2
Reputation: 69304
It would be instructive for you to print out the values in $dna_length
, $index
and $base
each time you go round the loop - immediately after you assign a value to $base
.
Your code would be more robust if you moved the incrementing of $index
to the end of the loop (outside of the if
/elsif
/else
block) and removed all of your next
statements.
An alternative "quick fix" is to chomp()
the input line before you start processing it.
Upvotes: 1