Reputation: 31
I have a big fasta file input.fasta which consists many duplicate sequences. I want to enter a header name and extract out all the sequences with the matching header. I know this could be done easily done with awk/sed/grep but I need a Perl code.
input.fasta
>OGH38127_some_organism
PAAALGFSHLARQEDSALTPKHYTWTAPGEGDVRAPCPVLNTLANHEFLPHNGKNITVDK
AITALGDAMNISPALATTFFTGGLKTNPTPNATWFDLDMLHKHNVLEHDGSLSRRDMHFD
TSNKFDAATFANFLSYFDANATVLGVNETADARARHAYDMSKMNPEFTITSSMLPIMVGE
SVMMMLVWGSVEEPGAQRDYFEYFFRNERLPVELGWTPGETEIGVPVVTAMITAMVAASP
TDVP
>ABC14110_some_different_org_name
WWVAPGPGDSRGPCPGLNTLANHGYLPHDGKGITLSILADAMLDGFNIARSDALLLFTQ
AIRTSPQYPATNSFNLHDLGRDQLNRHNVLEHDASLSRADDFFGSNHIFNETVFDESRAY
AMLANSKIARQINSKAFNPQYKFTSKTEQFSLGEIAAPIIAFGNSTSGEVNRTLVEYFFM
NERLPIELGWKKSEDGIALDDILRVTQMISKAASLITPSALSWTAETLTP
>OGH38127_some_organism
LPWSRPGPGAVRAPCPMLNTLANHGFLPHDGKNISEARTVQALGRALNIEKELSQFLFEK
ALTTNPHTNATTFSLNDLSRHNLLEHDASLSRQDAYFGDNHDFNQTIFDETRSYWPHPVI
DIQAAALSRQARVNTSIAKNPTYNMSELGLDFSYGETAAYILILGDKDFGKVNRSWVEYL
FENERLPVELGWTRHNETITSDDLNTMLEKVVN
.
.
.
I have tried with the following script but it is not giving any output.
script.pl
#!/perl/bin/perl -w
use strict;
use warnings;
print "Enter a fasta header to search for:\n";
my $head = <>;
my $file = "input.fasta";
open (READ, "$file") || die "Cannot open $file: $!.\n";
my %seqs;
my $header;
while (my $line = <READ>){
chomp $line;
$line =~ s/^>(.*)\n//;
if ($line =~ m/$head/){
$header = $1;
}
}
close (READ);
open( my $out , ">", "out.fasta" ) or die $!;
my @count_seq = keys %seqs;
foreach (@count_seq){
print $out $header, "\n";
print $out $seqs{$header}, "\n";
}
exit;
Please help me correct this script. Thanks!
Upvotes: 0
Views: 349
Reputation: 52344
If you use the Bioperl module Bio::SeqIO to handle the parsing of the fasta files, it becomes really simple:
#!/usr/bin/perl
use warnings;
use strict;
use Bio::SeqIO;
my ($file, $name) = @ARGV;
my $in = Bio::SeqIO->new(-file => $file, -format => "fasta");
my $out = Bio::SeqIO->new(-fh => \*STDOUT, -format => "fasta");
while (my $s = $in->next_seq) {
$out->write_seq($s) if $s->display_id eq $name;
}
run with perl grep_fasta.pl input.fasta OGH38127_some_organism
Upvotes: 4
Reputation: 241838
There's no need to store the sequences in memory, you can print them directly when reading the file. Use a flag variable ($inside
in the example) that tells you whether you're reading the desired sequence or not.
#! /usr/bin/perl
use warnings;
use strict;
my ($file, $header) = @ARGV;
my $inside;
open my $in, '<', $file or die $!;
while (<$in>) {
$inside = $1 eq $header if /^>(.*)/;
print if $inside;
}
Run as
perl script.pl file.fasta OGH38127_some_organism > output.fasta
Upvotes: 2