Difficulty in finding error in perl script

Question

Its a kind of bioinformatics concept but programmatic problem. I've tried a lot and at last I came here. I've reads like following.

ATGGAAG
TGGAAGT
GGAAGTC
GAAGTCG
AAGTCGC
AGTCGCG
GTCGCGG
TCGCGGA
CGCGGAA
GCGGAAT
CGGAATC

Now what I want to do is, in a simplistic way,

take the last 6 residues of first read -> check if any other read is starting with those 6 residues, if yes add the last residue of that read to the first read -> again same with the 2nd read and so on.

Here is the code what I've tried so far.

#!/usr/bin/perl -w

use strict;
use warnings;

my $in = $ARGV[0];
open(IN, $in);
my @short_reads = ;
my $first_read = $short_reads[0];
chomp $first_read;
my @all_end_res;


for(my $i=0; $i<=$#short_reads; $i++){
        chomp $short_reads[$i];
        my $end_of_kmers = substr($short_reads[$i], -6);
        if($short_reads[$i+1] =~ /^$end_of_kmers/){
                my $end_res = substr($short_reads[$i], -1);
                push(@all_end_res, $end_res);
        }

}

my $end_res2 = join('', @all_end_res);
print $first_read.$end_res2,"

";

At the end I should get an output like ATGGAAGTCGCGGAATC but I'm getting ATGGAAGGTCGCGGAAT. The error must be in if, any help is greatly appreciated.

Hynek -Pichi- Vychodil · Accepted Answer

There are three huge problems in IT.

Naming of things.
Off by one error.

And you just hit the second one. The problem is in a way you think about this task. You think in way I have this one string and if next one overlap I will add this one character to result. But correct way to think in this case I have this one string and if it overlaps with previous string or what I read so far, I will add one character or characters which are next.

#!/usr/bin/env perl
use strict;
use warnings;

use constant LENGTH => 6;

my $read = <>;
chomp $read;

while (<>) {
    chomp;
    last unless length > LENGTH;
    if ( substr( $read, -LENGTH() ) eq substr( $_, 0, LENGTH ) ) {
        $read .= substr( $_, LENGTH );
    }
    else {last}
}

print $read, "
";

I didn't get this ARGV[0] thing. It is useless and inflexible.

$ chmod +x code.pl
$ cat data
ATGGAAG
TGGAAGT
GGAAGTC
GAAGTCG
AAGTCGC
AGTCGCG
GTCGCGG
TCGCGGA
CGCGGAA
GCGGAAT
CGGAATC
$ ./code.pl data 
ATGGAAGTCGCGGAATC

But you have not defined what should happen if data doesn't overlap. Should there be some recovery or error? You can be also more strict

last unless length == LENGTH + 1;

Edit:

If you like working with an array you should try avoid using for(;;). It is prone to errors. (BTW for (my $i = 0; $i < @a; $i++) is more idiomatic.)

my @short_reads = <>;
chomp @short_reads;

my @all_end_res;
for my $i (1 .. $#short_reads) {
    my $prev_read = $short_reads[$i-1];
    my $curr_read = $short_reads[$i+1];
    my $end_of_kmers = substr($prev_read, -6);
    if ( $curr_read =~ /^\Q$end_of_kmers(.)/ ) {
        push @all_end_res, $1;
    }
}

print $short_reads[0], join('', @all_end_res), "
";

The performance and memory difference is negligible up to thousands of lines. Now you can ask why to accumulate characters into an array instead of accumulate it to string.

my @short_reads = <>;
chomp @short_reads;

my $read = $short_reads[0];
for my $i (1 .. $#short_reads) {
    my $prev_read = $short_reads[$i-1];
    my $curr_read = $short_reads[$i+1];
    my $end_of_kmers = substr($prev_read, -6);
    if ( $curr_read =~ /^\Q$end_of_kmers(.)/ ) {
        $read .= $1;
    }
}

print "$read
";

Now the question is why to use $prev_read when you have $end_of_kmers inside of $read.

my @short_reads = <>;
chomp @short_reads;

my $read = $short_reads[0];
for my $i (1 .. $#short_reads) {
    my $curr_read = $short_reads[$i+1];
    my $end_of_kmers = substr($read, -6);
    if ( $curr_read =~ /^\Q$end_of_kmers(.)/ ) {
        $read .= $1;
    }
}

print "$read
";

Now you can ask why I need indexes at all. You just should remove the first line to work with the rest of array.

my @short_reads = <>;
chomp @short_reads;

my $read = shift @short_reads;
for my $curr_read (@short_reads) {
    my $end_of_kmers = substr($read, -6);
    if ( $curr_read =~ /^\Q$end_of_kmers(.)/ ) {
        $read .= $1;
    }
}

print "$read
";

And with few more steps and tweaks you will end up with the code what I posted initially. I don't need an array at all because I look only to the current line and accumulator. The difference is in a way how you think about the problem. If you think in terms of arrays and indexes and looping or in terms of data flow, data processing and state/accumulator. With more experience, you don't have to do all those steps and make the final solution just due different approach to problem solving.

Edit2:

It is almost ten times faster using substr and eq then using regular expressions.

$ time ./code.pl data.out > data.test  

real    0m0.480s
user    0m0.468s
sys     0m0.008s
$ time ./code2.pl data.out > data2.test

real    0m4.520s
user    0m4.516s
sys     0m0.000s
$ cmp data.test data2.test && echo OK
OK
$ wc -c data.out data.test
6717368 data.out
839678 data.test

Difficulty in finding error in perl script

Answers (2)

Related Questions