jnmf
jnmf

Reputation: 137

Reverse complement of fasta file

I'm trying to get the reverse complement of RNA in a multi fasta file

input:

>cel-mir-39 MI0010 C elegans miR-39
UAUACCGAGAGCCCAGCUGAUUUCGUCUUGGUAAUAAGCUCGUCAUUGAGAUUAUCACCGGGUGUAAAUCAGCUUGGCUCAAAAAAAA

>cel-let-7 MI0001 C elegans let-7
UACACUGUGGAUCCGGUGAGGUAGUAGGUUGUAUAGUUUGGAAUAUUACCACCGGUGAACUAUGCAAUUUUCUACCUUACCGGAGGGGGGG

output:

>cel-mir-39 MI0010 C elegans miR-39
UUUUUUUUGAGCCAAGCUGAUUUACACCCGGUGAUAAUCUCAAUGACGAGCUUAUUACCAAGACGAAAUCAGCUGGGCUCUCGGUAUA

>cel-let-7 MI0001 C elegans let-7
CCCCCCCUCCGGUAAGGUAGAAAAUUGCAUAGUUCACCGGUGGUAAUAUUCCAAACUAUACAACCUACUACCUCACCGGAUCCACAGUGUA

But I'm getting this instead:

UUUUUUUUGAGCCAAGCUGAUUUACACCCGGUGAUAAUCUCAAUGACGAGCUUAUUACCAAGACGAAAUCAGCUGGGCUCUCGGUAUA
93-Rim snucele G 0100IM 93-rim-leg 
CCCCCCCUCCGGUAAGGUAGAAAAUUGCAUAGUUCACCGGUGGUAAUAUUCCAAACUAUACAACCUACUACCUCACCGGAUCCACAGUGUA
7-tel snucele G 1000IM 7-tel-leg 

my code:

#!/usr/bin/perl
use strict;
use warnings;

print "type in the path of the file\n";
my $file_name = <>;
chomp($file_name); 

open (FASTA, $file_name) or die "error #!"; 

$/ = ">";
<FASTA>;    
while (my $entry = <FASTA>){
    $entry = reverse $entry;
    $entry =~ tr/ACGUacgu/UGCAugca/;
    print "$entry \n";
}

close(FASTA);

How can I reverse only the sequence and not the header? Thanks

Upvotes: 2

Views: 1061

Answers (3)

zdim
zdim

Reputation: 66873

Reading records separated by > is a nice idea as it gives you the whole chunk at a time. However, here you want to process and merge lines but not the header, thus distinguishing between lines. It is clearer to read line by line.

The sequence-line is specific: all caps and nothing else. The blank line separates records to process. The remaining possibility is the header. The sequence is assembled by joining lines that match its pattern, and once we hit the blank line it is processed and printed.

open (FASTA, $file_name) or die "error $!";

# sequence, built by joining lines =~ /^[A-Z]+$/
my $sequence = '';

while (my $entry = <FASTA>)
{
    if ($entry =~ m/^[A-Z]+$/) {
        # Assemble the sequence from separate lines
        chomp($entry);
        $sequence .= $entry;
    }
    elsif ($entry =~ m/^\s*$/) { 
        # process and print the sequence and blank line, reset for next
        $sequence = reverse $sequence;
        $sequence =~ tr/ACGUacgu/UGCAugca/;
        print "$sequence\n";
        print "\n";
        $sequence = '';
    }
    else { # header
        print $entry;
    }
}

# Print the last sequence if the file didn't end with blank line    
if (length $sequence) {
    $sequence = reverse $sequence;
    $sequence =~ tr/ACGUacgu/UGCAugca/;
    print "$sequence\n";
}

The ^ and $ are anchors, for the beginning and end of string. So the regex matching the sequence requires that the whole line be strictly caps. The other regex allows only optional space \s*, specifying a blank line.

The sequence processing is copied from the question.

Upvotes: 2

mkHun
mkHun

Reputation: 5927

Try something as follow

First I split the data by the newline character. And store the header into the $header and the rest of the data in @ar.

Then join the array by the newline and store into the $entry. Then perform the substitution for to remove the \n>\r\s characters from the RNA sequence.

Then as usual reverse the string and perform the translation. Finally get the output by the print statement.

open my $fh,"<","filename.text" or die"error opening $!";

$/ = ">";

<$fh>;

while (<$fh>)
{
    my ($header,@ar) = split("\n",$_);

    my $entry =join("\n",@ar);

    $entry=~s/\n|\r|>|\s//g;

    $entry = reverse $entry;

    $entry =~ tr/ACGUacgu/UGCAugca/;

    print ">$header\n$entry\n\n";
}

Upvotes: 0

Kaz
Kaz

Reputation: 58500

TXR solution:

@(bind compl @(hash-from-pairs (zip "ACGUacgu" "UGCAugca")))
@(repeat)
>@header
@  (collect)
@rna
@  (until)

@  (end)
@  (output)
>@header
@(mapcar compl (reverse (cat-str rna)))

@  (end)
@(end)

Run:

$ txr revcomp.txr data
>cel-mir-39 MI0010 C elegans miR-39
UUUUUUUUGAGCCAAGCUGAUUUACACCCGGUGAUAAUCUCAAUGACGAGCUUAUUACCAAGACGAAAUCAGCUGGGCUCUCGGUAUA

>cel-let-7 MI0001 C elegans let-7
CCCCCCCUCCGGUAAGGUAGAAAAUUGCAUAGUUCACCGGUGGUAAUAUUCCAAACUAUACAACCUACUACCUCACCGGAUCCACAGUGUA

This variant formats the output to 46 columns, like the original:

@(bind compl @(hash-from-pairs (zip "ACGUacgu" "UGCAugca")))
@(repeat)
>@header
@  (collect)
@rna
@  (until)

@  (end)
@  (output)
>@header
@  (repeat :vars ((crna (tuples 46 (mapcar compl (reverse (cat-str rna)))))))
@crna
@  (end)

@  (end)
@(end)

Run:

$ txr revcomp.txr data
>cel-mir-39 MI0010 C elegans miR-39
UUUUUUUUGAGCCAAGCUGAUUUACACCCGGUGAUAAUCUCAAUGA
CGAGCUUAUUACCAAGACGAAAUCAGCUGGGCUCUCGGUAUA

>cel-let-7 MI0001 C elegans let-7
CCCCCCCUCCGGUAAGGUAGAAAAUUGCAUAGUUCACCGGUGGUAA
UAUUCCAAACUAUACAACCUACUACCUCACCGGAUCCACAGUGUA

Upvotes: 0

Related Questions