Reputation: 87
After a previous question i have some code which nearly does what i intend to do, but not quite.
I am trying to compare every position from FILE1 (3750/126477 etc) with every range in FILE2 (517-1878,2156-3289 etc). If it falls in a range, i want to print the position, range, direction and then the FASTA sequence from the next line onto the same line in the output file. Currently if multiple positions lie in the same range then it will group all the positions into a block before including the sequence only on the last line, when i want every match to include the relevant gene sequence.
My FILE1 example data:
7065_8#10 3750 - t
7065_8#10 126477 - c
7065_8#10 1200 + T
7065_8#10 3800 - t
My FILE2 example data:
>SAEMRSA15_00010 dnaA_chromosomal_replication_initiator_protein_DnaA 517 1878 forward
ATGTCGGAAAAAGAAATTTGGGAAAAAGTGCTTGAAATTGCTCAAGAAAAATTATCAGCTGTAAGTTACTCAACTTTCCTAAA
>SAEMRSA15_00020 dnaN_DNA_polymerase_III,_beta_chain 2156 3289 forward
ATGATGGAATTCACTATTAAAAGAGATTATTTTATTACACAATTAAATGACACATTAAAAGCTATTTCACCAAGAACAACA
>SAEMRSA15_00030 conserved_hypothetical_protein 3670 3915 forward
GTGATTATTTTGGTTCAAGAAGTTGTAGTAGAAGGAGACATTAATTTAGGTCAATTTCTAAAAACAGAAGGGATTATTGAATCTGGTGGTCAAG
My code:
#!/usr/bin/perl
use strict;
use warnings;
use autodie;
my $outputfile = "/Users/edwardtickle/Documents/CC22CDS.txt";
open FILE1, "/Users/edwardtickle/Documents/CC22indels.tab";
open FILE2, "/Users/edwardtickle/Documents/CC22_CDS_rmmge.aln";
open (OUTPUTFILE, ">$outputfile");
my @file1list=();
while (<FILE1>) {
if (/^\S+\s+(\d+)/) {
push @file1list, $1;
}
}
my $nextline = 0;
close FILE1;
while ( my $line = <FILE2> ) {
if ($nextline) {
print OUTPUTFILE "$line\n";
$nextline = '';
}
elsif ($line =~ /^>(\S+)\s+\S+\s+(\d+)\s+(\d+)\s+(\S+)/) {
my $cds1 = $1;
my $cds2 = $2;
my $cds3 = $3;
my $cds4 = $4;
for my $cc22 (@file1list) {
if ( $cc22 > $cds2 && $cc22 < $cds3 ) {
$nextline++;
print OUTPUTFILE "$cc22 $cds2 $cds3 $cds4\n";
}
}
}
}
close FILE2;
My results:
1200 517 1878 forward
ATGTCGGAAAAAGAAATTTGGGAAAAAGTGCTTGAAATTGCTCAAGAAAAATTATCAGCTGTAAGTTACTCAACTTTCCTAAA
3750 3670 3915 forward
3800 3670 3915 forward
GTGATTATTTTGGTTCAAGAAGTTGTAGTAGAAGGAGACATTAATTTAGGTCAATTTCTAAAAACAGAAGGGATTATTGAATCTGGTGGTCAAG
My desired results:
1200 517 1878 forward
ATGTCGGAAAAAGAAATTTGGGAAAAAGTGCTTGAAATTGCTCAAGAAAAATTATCAGCTGTAAGTTACT
3750 3670 3915 forward GTGATTATTTTGGTTCAAGAAGTTGTAGTAGAAGGAGACATTAATTTAGGTCAATTTCTAAAAACAGAAGGGATTATTGA
3800 3670 3915 forward
GTGATTATTTTGGTTCAAGAAGTTGTAGTAGAAGGAGACATTAATTTAGGTCAATTTCTAAAAACAGAAGGGATTATTGAATCTGGTGGTCAAG
I believe it is due to the fact that the first part of the code is never matched until the second if rule, but i don't know how to change the order while keeping it functional.
Alternatively, is there a way of printing the next line after a header match if it includes the letters ATCG (which it obviously always will). This would strike me as a more efficient way but yet again i don't know where to start.
Thanks for your help!
Upvotes: 0
Views: 492
Reputation: 9520
To get the results you want without changing your existing code too much, you can get the sequence line when you're processing the header line:
while ( my $line = <FILE2> ) {
if ($line =~ /^>(\S+)\s+\S+\s+(\d+)\s+(\d+)\s+(\S+)/) {
my $cds1 = $1;
my $cds2 = $2;
my $cds3 = $3;
my $cds4 = $4;
# fetch the next line from the file -- i.e. the sequence
$nextline = <FILE2>;
for my $cc22 (@file1list) {
if ( $cc22 > $cds2 && $cc22 < $cds3 ) {
print "$cc22 $cds2 $cds3 $cds4 $nextline";
}
}
}
}
Upvotes: 1
Reputation: 241848
You can use an inner loop to print all the matches in the same range.
#!/usr/bin/perl
use warnings;
use strict;
open my $IND, '<', 'file1' or die $!;
my @pos;
while (<$IND>) {
push @pos, (split)[1];
}
@pos = sort { $a <=> $b } @pos;
open my $FST, '<', 'file2' or die $!;
while (<$FST>) {
next unless /^>/;
my ($from, $to, $direction) = (split)[2 .. 4];
shift @pos while $pos[0] < $from;
next if $pos[0] > $to;
my $nextline = <$FST>;
while ($pos[0] <= $to) {
print "$pos[0] $from $to $direction\n";
print $nextline;
shift @pos;
}
}
Upvotes: 0