Tomb
Tomb

Reputation: 19

get fasta file sequence from a hash

I'm trying to put a FASTA file into a hash so that I can manipulate it later, with the ID as the key and the sequence as the value. But my output is only printing the last ID and joining all the sequences together.

input:

>cel-mir-35 MI0000006 Caenorhabditis elegans miR-35 stem-loop
UCUCGGAUCAGAUCGAGCCAUUGCUGGUUUCUUCCACAGUGGUACUUUCCAUUAGAACUA
UCACCGGGUGGAAACUAGCAGUGGCUCGAUCUUUUCC

>cel-mir-36 MI0000007 Caenorhabditis elegans miR-36 stem-loop
CACCGCUGUCGGGGAACCGCGCCAAUUUUCGCUUCAGUGCUAGACCAUCCAAAGUGUCUA
UCACCGGGUGAAAAUUCGCAUGGGUCCCCGACGCGGA

>cel-mir-37 MI0000008 Caenorhabditis elegans miR-37 stem-loop
UUCUAGAAACCCUUGGACCAGUGUGGGUGUCCGUUGCGGUGCUACAUUCUCUAAUCUGUA
UCACCGGGUGAACACUUGCAGUGGUCCUCGUGGUUUCU

>cel-mir-38 MI0000009 Caenorhabditis elegans miR-38 stem-loop
GUGAGCCAGGUCCUGUUCCGGUUUUUUCCGUGGUGAUAACGCAUCCAAAAGUCUCUAUCA
CCGGGAGAAAAACUGGAGUAGGACCUGUGACUCAU

my output is this

cel-mir-38 MI0000009 Caenorhabditis elegans miR-38 stem-loop     UCUCGGAUCAGAUCGAGCCAUUGCUGGUUUCUUCCACAGUGGUACUUUCCAUUAGAACUAUCACCGGGUGGAAACUAGGGCUCGAUCUUUUCCCACCGCUGUCGGGGAACCGCGCCAAUUUUCGCUUCAGUGCUAGACCAUCCAAAGUGUCUAUCACCGGGUGAAAAUUCGCAUGGGUCCCCGACGCGGAUUCUAGAAACCCUUGGACCAGUGUGGGUGUCCGUUGCGGUGCUACAUUCUCUAAUCUGUAUCACCGGGUGAACACUUGCAGUGGUCCUCGUGGUUUCUGUGAGCCAGGUCCUGUUCCGGUUUUUUCCGUGGUGAUAACGCAUCCAAAAGUCUCUAUCACCGGGAGAAAAACUGGAGUAGGACCUGUGACUCAU
cel-mir-38 MI0000009 Caenorhabditis elegans miR-38 stem-loop     UCUCGGAUCAGAUCGAGCCAUUGCUGGUUUCUUCCACAGUGGUACUUUCCAUUAGAACUAUCACCGGGUGGAAACUAGGGCUCGAUCUUUUCCCACCGCUGUCGGGGAACCGCGCCAAUUUUCGCUUCAGUGCUAGACCAUCCAAAGUGUCUAUCACCGGGUGAAAAUUCGCAUGGGUCCCCGACGCGGAUUCUAGAAACCCUUGGACCAGUGUGGGUGUCCGUUGCGGUGCUACAUUCUCUAAUCUGUAUCACCGGGUGAACACUUGCAGUGGUCCUCGUGGUUUCUGUGAGCCAGGUCCUGUUCCGGUUUUUUCCGUGGUGAUAACGCAUCCAAAAGUCUCUAUCACCGGGAGAAAAACUGGAGUAGGACCUGUGACUCAU
cel-mir-38 MI0000009 Caenorhabditis elegans miR-38 stem-loop     UCUCGGAUCAGAUCGAGCCAUUGCUGGUUUCUUCCACAGUGGUACUUUCCAUUAGAACUAUCACCGGGUGGAAACUAGGGCUCGAUCUUUUCCCACCGCUGUCGGGGAACCGCGCCAAUUUUCGCUUCAGUGCUAGACCAUCCAAAGUGUCUAUCACCGGGUGAAAAUUCGCAUGGGUCCCCGACGCGGAUUCUAGAAACCCUUGGACCAGUGUGGGUGUCCGUUGCGGUGCUACAUUCUCUAAUCUGUAUCACCGGGUGAACACUUGCAGUGGUCCUCGUGGUUUCUGUGAGCCAGGUCCUGUUCCGGUUUUUUCCGUGGUGAUAACGCAUCCAAAAGUCUCUAUCACCGGGAGAAAAACUGGAGUAGGACCUGUGACUCAU
cel-mir-38 MI0000009 Caenorhabditis elegans miR-38 stem-loop     UCUCGGAUCAGAUCGAGCCAUUGCUGGUUUCUUCCACAGUGGUACUUUCCAUUAGAACUAUCACCGGGUGGAAACUAGGGCUCGAUCUUUUCCCACCGCUGUCGGGGAACCGCGCCAAUUUUCGCUUCAGUGCUAGACCAUCCAAAGUGUCUAUCACCGGGUGAAAAUUCGCAUGGGUCCCCGACGCGGAUUCUAGAAACCCUUGGACCAGUGUGGGUGUCCGUUGCGGUGCUACAUUCUCUAAUCUGUAUCACCGGGUGAACACUUGCAGUGGUCCUCGUGGUUUCUGUGAGCCAGGUCCUGUUCCGGUUUUUUCCGUGGUGAUAACGCAUCCAAAAGUCUCUAUCACCGGGAGAAAAACUGGAGUAGGACCUGUGACUCAU

I'd like to get each ID and the corresponding sequence as an output

code:

use strict;
use warnings;

my %fastahash = ();
my $id        = '';
my $seq       = '';

open FILE, "file.fasta", or die $!;

while ( <FILE> ) {

    chomp;

    if ( $_ =~ /^>(.+)/ ) {
        $id = $1;
    }
    elsif ( $_ =~ m/^[A-Z]+$/ ) {
        $seq .= $_;

    }
    else {
        $fastahash{$id} .= $_;
    }
}

foreach my $sequence ( keys %fastahash ) {
    print "$id $seq\n";
}

close FILE;

Which part should I change?

Also, how can I get the sequence as key and id as value?

Upvotes: 0

Views: 304

Answers (3)

Borodin
Borodin

Reputation: 126722

I realise that this isn't a code review, but I thought it would be useful to make some comments on your code

  • There's ordinarily no need to define variables when you declare them. In fact it will often remove useful error messages if you set scalar variables to the empty string

  • It is best practice to use lexical file handles and the three-parameter form of open. So

    open FILE, "file.fasta", or die $!;
    

    is better written as

    open my $fh, '<', 'file.fasta' or die $!;
    

    (Note that you also have a superfluous comma in your original code.)

    Lexical file handles generally remove the necessity to close them, as they will be closed implicitly when they destroyed when they go out of scope

  • You may be unfamiliar with Perl's default variable $_, but code can be made much clearer and more concise if you use it

    You already use it with chomp, which is equivalent to chomp $_, and $_ =~ /^>(.+)/ need be only /^>(.+)/

  • Note that foreach is exactly equivalent to for, and most programmers familiar with Perl will prefer the former

I would write your program something like this

use strict;
use warnings;

open my $fh, '<', 'file.fasta' or die $!;

my %fasta_hash;
my ($id, $seq);

while ( <$fh> ) {

    chomp;

    if ( /^>(.+)/ ) {
        $id = $1;
    }
    elsif ( /\S/ and not /[^ACGTU]/ ) {
        $seq .= $_;
    }
    else {
        $fasta_hash{$id} = $seq;
    }
}

for my $id ( keys %fasta_hash ) {
    print "$id -- $fasta_hash{$id}\n";
}

output

cel-mir-35 MI0000006 Caenorhabditis elegans miR-35 stem-loop -- UCUCGGAUCAGAUCGAGCCAUUGCUGGUUUCUUCCACAGUGGUACUUUCCAUUAGAACUAUCACCGGGUGGAAACUAGCAGUGGCUCGAUCUUUUCC
cel-mir-37 MI0000008 Caenorhabditis elegans miR-37 stem-loop -- UCUCGGAUCAGAUCGAGCCAUUGCUGGUUUCUUCCACAGUGGUACUUUCCAUUAGAACUAUCACCGGGUGGAAACUAGCAGUGGCUCGAUCUUUUCCCACCGCUGUCGGGGAACCGCGCCAAUUUUCGCUUCAGUGCUAGACCAUCCAAAGUGUCUAUCACCGGGUGAAAAUUCGCAUGGGUCCCCGACGCGGAUUCUAGAAACCCUUGGACCAGUGUGGGUGUCCGUUGCGGUGCUACAUUCUCUAAUCUGUAUCACCGGGUGAACACUUGCAGUGGUCCUCGUGGUUUCU
cel-mir-36 MI0000007 Caenorhabditis elegans miR-36 stem-loop -- UCUCGGAUCAGAUCGAGCCAUUGCUGGUUUCUUCCACAGUGGUACUUUCCAUUAGAACUAUCACCGGGUGGAAACUAGCAGUGGCUCGAUCUUUUCCCACCGCUGUCGGGGAACCGCGCCAAUUUUCGCUUCAGUGCUAGACCAUCCAAAGUGUCUAUCACCGGGUGAAAAUUCGCAUGGGUCCCCGACGCGGA

As for how to reverse the hash so that the sequences are used as keys, in my version above you can just change the line $fasta_hash{$id} = $seq; to $fasta_hash{$seq} = $id;

Upvotes: 0

aghast
aghast

Reputation: 15300

You're assigning $_ to fastahash when you should, I think, be assigning $seq. Also, you never reset id or seq, so there's a potential bug. Try something like this:

while (<FILE>) {
    chomp;

    if (/^>(.+)/) {
        $id = $1;
    } elsif (/^[A-Z]+$/) {
        $seq .= $_;
    } else {
        $fastahash{$id} = $seq if $id;
        $id = undef;
        $seq = '';
    }
}

$fastahash{$id} = $seq if $id;

Upvotes: 0

choroba
choroba

Reputation: 241758

You aren't correctly accumulating the hash, and you aren't printing it either.

    while (<FILE>) {
        chomp;

        if($_ =~ /^>(.+)/){
            $id = $1;

        } elsif (/^[A-Z]+$/) {
            $seq .= $_;

        } else {
            $fastahash{$id} = $seq;   # Populate the hash.
        }
    }

   for my $id (keys %fastahash) {
      print "$id $fastahash{$id}\n";  # Print it.

   }

Upvotes: 1

Related Questions