Reputation: 1
I have 1st file ( Delta_spike_sorted.fasta)with a format of:
>lcl|KJ584357.1
AAAAA
>lcl|JQ065046.1
GGGGG
and 2nd file (Delta_final.fasta) with the format of:
>KJ584357.1 Porcine coronavirus HKU15 strain KY4813, complete genome
TTTTTT
>JQ065046.1 Magpie-robin coronavirus HKU18 strain HKU18-chu3, complete genome
CCCCCC
I'm trying to write a script to replace the >lcl... of the 1st file with the equivalent title of the 2nd file by matching their IDs (those next to lcl). The final outcome should be something like this:
>Porcine coronavirus HKU15 strain KY4813
AAAAA
>Magpie-robin coronavirus HKU18 strain HKU18-chu3
GGGGG
Now that I see it again, maybe using some hashes would be the most suitable option (sorry for the many mistakes, it's my first post here, also i'm a noob on programming)
#!/usr/bin/perl -w
open (FIN, "< coronavirus_complete/complete_final/Delta_final.fasta") or die "unable to open FIN \n";
open (FH, "< coronavirus_cds/Spikes/Spikes_complete/sorted/Delta_spike_sorted.fasta") or die "unable
to open FH \n";
while ($line=<FH>){
if ($line =~ /^>/){
chomp($line);
$acc=substr($line,5,9);
#print "$acc\n";
}
while ($string=<FIN>){
if ($string =~ /^>/){
chomp ($string);
$gen=substr($string,12);
#print "$gen\n";
}
if ($acc =~ /\Q$string/){
$line =~ s/$line/$gen/g;
print "$acc\n";
}
}
Upvotes: 0
Views: 94
Reputation: 40748
Here is an example where I read the definitions from the second file into a hash first. Then you avoid to reread that file for each line in the first file:
use feature qw(say);
use strict;
use warnings;
{ # <-- scope to prevent local lexical variable to 'leak' into subs below
my $map = read_fin("Delta_final.fasta");
my $fn = 'Delta_spike_sorted.fasta';
open (my $fh, '<', $fn) or die "unable to open file '$fn': $! \n";
while (my $line=<$fh>) {
chomp $line;
if ($line =~ /^>/){
my $acc = substr $line,5,10;
if (exists $map->{$acc}) {
say ">$map->{$acc}";
next;
}
}
say $line;
}
close $fh;
}
sub read_fin {
my ($fn) = @_;
my %map;
open ( my $fh, '<', $fn ) or die "Could not open file '$fn': $!";
while( my $line = <$fh> ) {
chomp $line;
if ( $line =~ /^>((?:\S){10})\s+(\S.*)$/ ) {
$map{$1} = $2;
}
}
close $fh;
return \%map;
}
Output:
>Porcine coronavirus HKU15 strain KY4813, complete genome
AAAAA
>Magpie-robin coronavirus HKU18 strain HKU18-chu3, complete genome
GGGGG
Upvotes: 1