Reputation: 33
Its a kind of bioinformatics concept but programmatic problem. I've tried a lot and at last I came here. I've reads like following.
ATGGAAG
TGGAAGT
GGAAGTC
GAAGTCG
AAGTCGC
AGTCGCG
GTCGCGG
TCGCGGA
CGCGGAA
GCGGAAT
CGGAATC
Now what I want to do is, in a simplistic way,
take the last 6 residues of first read -> check if any other read is starting with those 6 residues, if yes add the last residue of that read to the first read -> again same with the 2nd read and so on.
Here is the code what I've tried so far.
#!/usr/bin/perl -w
use strict;
use warnings;
my $in = $ARGV[0];
open(IN, $in);
my @short_reads = <IN>;
my $first_read = $short_reads[0];
chomp $first_read;
my @all_end_res;
for(my $i=0; $i<=$#short_reads; $i++){
chomp $short_reads[$i];
my $end_of_kmers = substr($short_reads[$i], -6);
if($short_reads[$i+1] =~ /^$end_of_kmers/){
my $end_res = substr($short_reads[$i], -1);
push(@all_end_res, $end_res);
}
}
my $end_res2 = join('', @all_end_res);
print $first_read.$end_res2,"\n\n";
At the end I should get an output like ATGGAAGTCGCGGAATC
but I'm getting ATGGAAGGTCGCGGAAT
. The error must be in if
, any help is greatly appreciated.
Upvotes: 2
Views: 106
Reputation: 26121
There are three huge problems in IT.
And you just hit the second one. The problem is in a way you think about this task. You think in way I have this one string and if next one overlap I will add this one character to result. But correct way to think in this case I have this one string and if it overlaps with previous string or what I read so far, I will add one character or characters which are next.
#!/usr/bin/env perl
use strict;
use warnings;
use constant LENGTH => 6;
my $read = <>;
chomp $read;
while (<>) {
chomp;
last unless length > LENGTH;
if ( substr( $read, -LENGTH() ) eq substr( $_, 0, LENGTH ) ) {
$read .= substr( $_, LENGTH );
}
else {last}
}
print $read, "\n";
I didn't get this ARGV[0]
thing. It is useless and inflexible.
$ chmod +x code.pl
$ cat data
ATGGAAG
TGGAAGT
GGAAGTC
GAAGTCG
AAGTCGC
AGTCGCG
GTCGCGG
TCGCGGA
CGCGGAA
GCGGAAT
CGGAATC
$ ./code.pl data
ATGGAAGTCGCGGAATC
But you have not defined what should happen if data doesn't overlap. Should there be some recovery or error? You can be also more strict
last unless length == LENGTH + 1;
Edit:
If you like working with an array you should try avoid using for(;;)
. It is prone to errors. (BTW for (my $i = 0; $i < @a; $i++)
is more idiomatic.)
my @short_reads = <>;
chomp @short_reads;
my @all_end_res;
for my $i (1 .. $#short_reads) {
my $prev_read = $short_reads[$i-1];
my $curr_read = $short_reads[$i+1];
my $end_of_kmers = substr($prev_read, -6);
if ( $curr_read =~ /^\Q$end_of_kmers(.)/ ) {
push @all_end_res, $1;
}
}
print $short_reads[0], join('', @all_end_res), "\n";
The performance and memory difference is negligible up to thousands of lines. Now you can ask why to accumulate characters into an array instead of accumulate it to string.
my @short_reads = <>;
chomp @short_reads;
my $read = $short_reads[0];
for my $i (1 .. $#short_reads) {
my $prev_read = $short_reads[$i-1];
my $curr_read = $short_reads[$i+1];
my $end_of_kmers = substr($prev_read, -6);
if ( $curr_read =~ /^\Q$end_of_kmers(.)/ ) {
$read .= $1;
}
}
print "$read\n";
Now the question is why to use $prev_read
when you have $end_of_kmers
inside of $read
.
my @short_reads = <>;
chomp @short_reads;
my $read = $short_reads[0];
for my $i (1 .. $#short_reads) {
my $curr_read = $short_reads[$i+1];
my $end_of_kmers = substr($read, -6);
if ( $curr_read =~ /^\Q$end_of_kmers(.)/ ) {
$read .= $1;
}
}
print "$read\n";
Now you can ask why I need indexes at all. You just should remove the first line to work with the rest of array.
my @short_reads = <>;
chomp @short_reads;
my $read = shift @short_reads;
for my $curr_read (@short_reads) {
my $end_of_kmers = substr($read, -6);
if ( $curr_read =~ /^\Q$end_of_kmers(.)/ ) {
$read .= $1;
}
}
print "$read\n";
And with few more steps and tweaks you will end up with the code what I posted initially. I don't need an array at all because I look only to the current line and accumulator. The difference is in a way how you think about the problem. If you think in terms of arrays and indexes and looping or in terms of data flow, data processing and state/accumulator. With more experience, you don't have to do all those steps and make the final solution just due different approach to problem solving.
Edit2:
It is almost ten times faster using substr
and eq
then using regular expressions.
$ time ./code.pl data.out > data.test
real 0m0.480s
user 0m0.468s
sys 0m0.008s
$ time ./code2.pl data.out > data2.test
real 0m4.520s
user 0m4.516s
sys 0m0.000s
$ cmp data.test data2.test && echo OK
OK
$ wc -c data.out data.test
6717368 data.out
839678 data.test
Upvotes: 2
Reputation: 6568
with minor modification:
use warnings;
use strict;
open my $in, '<', $ARGV[0] or die $!;
chomp(my @short_reads = <$in>);
my $first_read = $short_reads[0];
my @all_end_res;
for(my $i=0; $i<=$#short_reads; $i++){
chomp $short_reads[$i];
my $end_of_kmers = substr($short_reads[$i], -6);
my ($next_read) = $short_reads[$i+1];
if( (defined $next_read) and ($next_read =~ /^\Q$end_of_kmers/)){
my $end_res = substr($next_read, -1);
push(@all_end_res, $end_res);
}
}
my $end_res2 = join('', @all_end_res);
print $first_read.$end_res2,"\n";
ATGGAAGTCGCGGAATC
Upvotes: 0