xristos bramis
xristos bramis

Reputation: 1

how can I replace lines from 2 different files using regex in Perl?

I have 1st file ( Delta_spike_sorted.fasta)with a format of:

>lcl|KJ584357.1  
AAAAA 
>lcl|JQ065046.1  
GGGGG

and 2nd file (Delta_final.fasta) with the format of:

>KJ584357.1 Porcine coronavirus HKU15 strain KY4813, complete genome
TTTTTT 
>JQ065046.1 Magpie-robin coronavirus HKU18 strain HKU18-chu3, complete genome  
CCCCCC

I'm trying to write a script to replace the >lcl... of the 1st file with the equivalent title of the 2nd file by matching their IDs (those next to lcl). The final outcome should be something like this:

>Porcine coronavirus HKU15 strain KY4813
AAAAA  
>Magpie-robin coronavirus HKU18 strain HKU18-chu3  
GGGGG

Now that I see it again, maybe using some hashes would be the most suitable option (sorry for the many mistakes, it's my first post here, also i'm a noob on programming)

#!/usr/bin/perl -w
open (FIN, "< coronavirus_complete/complete_final/Delta_final.fasta") or die "unable to open FIN \n";
open (FH, "< coronavirus_cds/Spikes/Spikes_complete/sorted/Delta_spike_sorted.fasta") or die "unable 
to open FH \n";

while ($line=<FH>){
    if ($line =~ /^>/){
        chomp($line);
        $acc=substr($line,5,9);
        #print "$acc\n";
    }
    while ($string=<FIN>){
        if ($string =~ /^>/){
        chomp ($string);
        $gen=substr($string,12);
        #print "$gen\n";
    }       
       if ($acc =~ /\Q$string/){
           $line =~ s/$line/$gen/g;
        print "$acc\n";
       }
}

Upvotes: 0

Views: 94

Answers (1)

H&#229;kon H&#230;gland
H&#229;kon H&#230;gland

Reputation: 40748

Here is an example where I read the definitions from the second file into a hash first. Then you avoid to reread that file for each line in the first file:

use feature qw(say);
use strict;
use warnings;

{  # <-- scope to prevent local lexical variable to 'leak' into subs below
    my $map = read_fin("Delta_final.fasta");
    my $fn = 'Delta_spike_sorted.fasta';
    open (my $fh, '<', $fn) or die "unable to open file '$fn': $! \n";

    while (my $line=<$fh>) {
        chomp $line;
        if ($line =~ /^>/){
            my $acc = substr $line,5,10;
            if (exists $map->{$acc}) {
                say ">$map->{$acc}";
                next;
            }
        }
        say  $line;
    }
    close $fh;
}

sub read_fin {
    my ($fn) = @_;

    my %map;
    open ( my $fh, '<', $fn ) or die "Could not open file '$fn': $!";
    while( my $line = <$fh> ) {
        chomp $line;
        if ( $line =~ /^>((?:\S){10})\s+(\S.*)$/ ) {
            $map{$1} = $2;
        }
    }
    close $fh;
    return \%map;
}

Output:

>Porcine coronavirus HKU15 strain KY4813, complete genome
AAAAA
>Magpie-robin coronavirus HKU18 strain HKU18-chu3, complete genome
GGGGG

Upvotes: 1

Related Questions