Reputation: 75237
I want to implement this algorithm in Perl. Let's accept that:
First element of DNA1 is G, we will find if there is G at DNA2 and point it with a dot. We continue it till the end so the image shows evey same element intersections as a dot.
The next step is that: connecting points. To point to dots first one should be at left top corner of a small square and the other one at right bottom(I mean lines should have a 135 degree) If stringency is 2, it means that reject the lines which has occured from 2 and less then 2 dots(It means that if stringency was 3, there would be just one line at image).
Last step is that: wordcount. If wordcount is 1(it is one at image) it means that compare elements one by one. If it was 3 it means that compare 3 of them together. You can write a program that wordcount is 1 because it is always 1.
I searched about it and this is what I have:
$infile1 = "DNA1.txt";
$infile2 = "DNA2.txt";
$outfile = "plot.txt";
$wordsize=0;
$stringency=0;
open inf, $infile1 or die "STOP! File $infile1 not found.\n";
$sequence1=<inf>;
chomp $sequence1;
@seq1=split //,$sequence1;
close inf;
open inf, $infile2 or die "STOP! File $infile2 not found.\n";
$sequence2=<inf>;
chomp $sequence2;
@seq2=split //,$sequence2;
close inf;
$Lseq1=$#seq1+1;
$Lseq2=$#seq2+1;
open ouf, ">$outfile";
for ($i=0;$i<$Lseq1;$i++){
print ouf "\n";
for ($j=0;$j<$Lseq2;$j++){
$match=0;
for ($w=0;$w<=$wordsize;$w++){
if($seq1[$i+$w] eq $seq2[$j+$w]){
$match++;
}
}
if($match > $stringency){
print ouf "1";
}
else{
print ouf "0";
}
}
}
Can you check it about errors and how can I optimize my code with more less code at Perl?
PS: I think it is OK to accept $wordsize equals to $stringency every time.
EDIT 1: I have edited my code and it works for just puting dot.
EDIT 2: Algorithm is like that:
qseq, sseq = sequences
win = number of elements to compare for each point
Strig = number of matches required for a point
for each q in qseq:
for each s in sseq:
if CompareWindow(qseq[q:q+win], s[s:s+win], strig):
AddDot(q, s)
EDIT 3: Here is a better algorithm suggestion:
osl.iu.edu/~chemuell/projects/bioinf/dotplot.ppt
Any idea to improve code according to that better algorithm is welcome.
Upvotes: 0
Views: 326
Reputation: 386461
First, the innermost for loop is totally unnecessary. Getting rid of it will speed up your code.
Secondly, your code doesn't work for $stringency other than 0.
Fix:
use strict;
use warnings;
my $s1 = 'GACTAGGC';
my $s2 = 'AGCTAGGA';
my $stringency = 0;
my @s1 = split //, $s1;
my @s2 = split //, $s2;
my @L;
for my $i (0..$#s1) {
for my $j (0..$#s2) {
if ($s1[$i] ne $s2[$j]) {
$L[$i][$j] = 0;
} elsif ($i == 0 || $j == 0) {
$L[$i][$j] = 1;
} else {
$L[$i][$j] = $L[$i-1][$j-1] + 1;
}
print $L[$i][$j] <= $stringency ? "0" : "1";
}
print("\n");
}
Now that we have an efficient algorithm, we can tweak the implementation.
use strict;
use warnings;
my $s1 = 'GACTAGGC';
my $s2 = 'AGCTAGGA';
my $stringency = 0;
my @s1 = split //, $s1;
my @s2 = split //, $s2;
my @L = (0) x @s2;
for my $i (0..$#s1) {
for my $j (0..$#s2) {
if ($s1[$i] eq $s2[$j]) {
++$L[$j];
} else {
$L[$j] = 0;
}
print $L[$j] <= $stringency ? "0" : "1";
}
print("\n");
pop @L;
unshift @L, 0;
}
If you want a better idea of what's going on, replace
print $L[$j] <= $stringency ? "0" : "1";
with
print $L[$j];
You'll get something like
01000110
10001002
00100000
00020000
10003001
02000410
01000150
00200000
By the way, what are trying to achieve is remarkably similar to finding the longest common substring.
Update To get $s1
and $s2
from files, one line at a time,
open(my $fh1, '<', ...) or die(...);
open(my $fh2, '<', ...) or die(...);
for (;;) {
my $s1 = <$fh1>;
my $s2 = <$fh2>;
die("Files have different length\n")
if defined($s1) && !defined($s2)
|| !defined($s1) && defined($s2);
last if !defined(($s1);
chomp($s1);
chomp($s2);
... code to generate graph ...
}
Upvotes: 4