Reputation: 137
I'm trying to get the reverse complement of RNA in a multi fasta file
input:
>cel-mir-39 MI0010 C elegans miR-39
UAUACCGAGAGCCCAGCUGAUUUCGUCUUGGUAAUAAGCUCGUCAUUGAGAUUAUCACCGGGUGUAAAUCAGCUUGGCUCAAAAAAAA
>cel-let-7 MI0001 C elegans let-7
UACACUGUGGAUCCGGUGAGGUAGUAGGUUGUAUAGUUUGGAAUAUUACCACCGGUGAACUAUGCAAUUUUCUACCUUACCGGAGGGGGGG
output:
>cel-mir-39 MI0010 C elegans miR-39
UUUUUUUUGAGCCAAGCUGAUUUACACCCGGUGAUAAUCUCAAUGACGAGCUUAUUACCAAGACGAAAUCAGCUGGGCUCUCGGUAUA
>cel-let-7 MI0001 C elegans let-7
CCCCCCCUCCGGUAAGGUAGAAAAUUGCAUAGUUCACCGGUGGUAAUAUUCCAAACUAUACAACCUACUACCUCACCGGAUCCACAGUGUA
But I'm getting this instead:
UUUUUUUUGAGCCAAGCUGAUUUACACCCGGUGAUAAUCUCAAUGACGAGCUUAUUACCAAGACGAAAUCAGCUGGGCUCUCGGUAUA
93-Rim snucele G 0100IM 93-rim-leg
CCCCCCCUCCGGUAAGGUAGAAAAUUGCAUAGUUCACCGGUGGUAAUAUUCCAAACUAUACAACCUACUACCUCACCGGAUCCACAGUGUA
7-tel snucele G 1000IM 7-tel-leg
my code:
#!/usr/bin/perl
use strict;
use warnings;
print "type in the path of the file\n";
my $file_name = <>;
chomp($file_name);
open (FASTA, $file_name) or die "error #!";
$/ = ">";
<FASTA>;
while (my $entry = <FASTA>){
$entry = reverse $entry;
$entry =~ tr/ACGUacgu/UGCAugca/;
print "$entry \n";
}
close(FASTA);
How can I reverse only the sequence and not the header? Thanks
Upvotes: 2
Views: 1061
Reputation: 66873
Reading records separated by >
is a nice idea as it gives you the whole chunk at a time. However, here you want to process and merge lines but not the header, thus distinguishing between lines. It is clearer to read line by line.
The sequence-line is specific: all caps and nothing else. The blank line separates records to process. The remaining possibility is the header. The sequence is assembled by joining lines that match its pattern, and once we hit the blank line it is processed and printed.
open (FASTA, $file_name) or die "error $!";
# sequence, built by joining lines =~ /^[A-Z]+$/
my $sequence = '';
while (my $entry = <FASTA>)
{
if ($entry =~ m/^[A-Z]+$/) {
# Assemble the sequence from separate lines
chomp($entry);
$sequence .= $entry;
}
elsif ($entry =~ m/^\s*$/) {
# process and print the sequence and blank line, reset for next
$sequence = reverse $sequence;
$sequence =~ tr/ACGUacgu/UGCAugca/;
print "$sequence\n";
print "\n";
$sequence = '';
}
else { # header
print $entry;
}
}
# Print the last sequence if the file didn't end with blank line
if (length $sequence) {
$sequence = reverse $sequence;
$sequence =~ tr/ACGUacgu/UGCAugca/;
print "$sequence\n";
}
The ^
and $
are anchors, for the beginning and end of string. So the regex matching the sequence requires that the whole line be strictly caps. The other regex allows only optional space \s*
, specifying a blank line.
The sequence processing is copied from the question.
Upvotes: 2
Reputation: 5927
Try something as follow
First I split the data by the newline character. And store the header into the $header
and the rest of the data in @ar
.
Then join the array by the newline and store into the $entry
. Then perform the substitution for to remove the \n>\r\s
characters from the RNA sequence.
Then as usual reverse the string and perform the translation. Finally get the output by the print statement.
open my $fh,"<","filename.text" or die"error opening $!";
$/ = ">";
<$fh>;
while (<$fh>)
{
my ($header,@ar) = split("\n",$_);
my $entry =join("\n",@ar);
$entry=~s/\n|\r|>|\s//g;
$entry = reverse $entry;
$entry =~ tr/ACGUacgu/UGCAugca/;
print ">$header\n$entry\n\n";
}
Upvotes: 0
Reputation: 58500
TXR solution:
@(bind compl @(hash-from-pairs (zip "ACGUacgu" "UGCAugca")))
@(repeat)
>@header
@ (collect)
@rna
@ (until)
@ (end)
@ (output)
>@header
@(mapcar compl (reverse (cat-str rna)))
@ (end)
@(end)
Run:
$ txr revcomp.txr data
>cel-mir-39 MI0010 C elegans miR-39
UUUUUUUUGAGCCAAGCUGAUUUACACCCGGUGAUAAUCUCAAUGACGAGCUUAUUACCAAGACGAAAUCAGCUGGGCUCUCGGUAUA
>cel-let-7 MI0001 C elegans let-7
CCCCCCCUCCGGUAAGGUAGAAAAUUGCAUAGUUCACCGGUGGUAAUAUUCCAAACUAUACAACCUACUACCUCACCGGAUCCACAGUGUA
This variant formats the output to 46 columns, like the original:
@(bind compl @(hash-from-pairs (zip "ACGUacgu" "UGCAugca")))
@(repeat)
>@header
@ (collect)
@rna
@ (until)
@ (end)
@ (output)
>@header
@ (repeat :vars ((crna (tuples 46 (mapcar compl (reverse (cat-str rna)))))))
@crna
@ (end)
@ (end)
@(end)
Run:
$ txr revcomp.txr data
>cel-mir-39 MI0010 C elegans miR-39
UUUUUUUUGAGCCAAGCUGAUUUACACCCGGUGAUAAUCUCAAUGA
CGAGCUUAUUACCAAGACGAAAUCAGCUGGGCUCUCGGUAUA
>cel-let-7 MI0001 C elegans let-7
CCCCCCCUCCGGUAAGGUAGAAAAUUGCAUAGUUCACCGGUGGUAA
UAUUCCAAACUAUACAACCUACUACCUCACCGGAUCCACAGUGUA
Upvotes: 0