Reputation: 67
I'm trying to read a FASTA file into a hash.
Problem
my $primer_seq_14bp = q(); // --> works
my $primer_seq_14bp; // --> error
Use of uninitialized value $primer_seq_14bp in hash element at ./script.pl line 24, <FWDPRIMER> line 1.
Why does this variable need to be declared as an empty string?
This is part of my first "on my own" script, so feedback on formatting, efficiency, etc. is welcome!
Code
#!/usr/bin/perl
use strict; use warnings;
my %primer_name_for;
my $primer_name;
my $primer_seq_14bp;
# Store fwd primer sequences in a sequence{name} hash.
open FWDPRIMER, '<', $ARGV[0] or die "error reading $ARGV[0]\n";
while (<FWDPRIMER>) {
chomp;
if ( /^(>.+)/ ) {
$primer_name = $1;
}
else {
$primer_seq = $_;
$primer_seq_14bp = substr $primer_seq, 0, 14;
}
$primer_name_for{$primer_seq_14bp} = $primer_name;
}
close FWDPRIMER;
Input
Fasta file, which is formatted:
>Sequence_name
DNA_sequence
>Sequence_name
DNA_sequence
etc.
The entire file is in this format, so I thought I could use $_
because if the line isn't a name it has to be a sequence.
Goal
A hash where DNA_sequence
is the key for Sequence_name
. The ultimate action is to search the hash with read sequences to get the biological name of the read (reads come with a default name).
Example
Input primer
>snp_fwd_primer
AAGCTCCTGCAGGTCATCTC
Input read
>read_name
AAGCTCCTGCAGGTCATCTCTAGTTGACACCTTTGCTGACAATTATTGTG
Desired output
Note that the first 20 bp of the read matches the primer sequence. The script cuts the read sequences to 14 bp because that is the length of my shortest primer sequence.
>snp_fwd_primer
AAGCTCCTGCAGGTCATCTCTAGTTGACACCTTTGCTGACAATTATTGTG
Update
I have a solution, however, the answers below are much better. I'm sharing it for other lost biologists (my solution is based on bioinformatics forums).
I changed the record delimiter $/
to read the input primer fasta one entry at a time (I always convert multi-line fasta to single-line; for multi-fasta see @ikegami's answer). Then used regrex to separate the header from the sequence. The major downside is that I have to manually remove the very first >
from the input primer fasta (learned that you can't close and reopen the same file in a perl script).
This did not apply to the input read fasta because I wanted only the sequence line, so I skipped any line that began with >
.
#!/usr/bin/perl
# Usage: thisScript.pl fwdPrimers.fasta reads.fasta > reads_with_primer_names.fasta
# Fasta files need to be in single-line format.
# Before running, remove the first instance of '>' from the fwdPrimerFastaFile!
# Purpose: for reads that have marker forward primer sequences, rename the read with the marker name.
use strict; use warnings;
open PRIMERFILE, '<', $ARGV[0] or die "error reading $ARGV[0]\n";
$/ = '>';
# Store fwd primer sequences in a sequence{name} hash.
my %primer_name_for;
while (<PRIMERFILE>) {
/(.+)\n(.+)\n/ and my $primer_name = $1 and my $primer_seq = $2;
my $primer_seq_14bp = substr $primer_seq, 0, 14;
$primer_name_for{$primer_seq_14bp} = $primer_name;
}
close PRIMERFILE;
# Store reads in a full_length_read{trimmed_read} hash.
# Use trimmed read to search the fwd primer hash.
# Combine primer name and full length read.
open READFILE, '<', $ARGV[1] or die "error reading $ARGV[1]\n";
$/ = "\n";
my %read_fullSeq_for;
my $read_seq;
while (<READFILE>) {
chomp;
if($_ =~ /^>/){
next;
}
else {$read_seq = $_};
my $read_seq_14bp = substr $read_seq, 0, 14;
$read_fullSeq_for{$read_seq_14bp} = $read_seq;
if (exists $primer_name_for{$read_seq_14bp} and $read_fullSeq_for{$read_seq_14bp}) {
print ">$primer_name_for{$read_seq_14bp}\n$read_fullSeq_for{$read_seq_14bp}\n";
}
}
close READFILE;
__END__
Upvotes: 2
Views: 331
Reputation: 385565
Say the first line matches /^(>.+)/
. Then you end up doing
$primer_name_for{undef} = $primer_name;
because you never assigned anything to
$primer_seq_14bp
You should only be adding to the hash when you have a sequence.
my %primer_name_for;
while (<>) {
chomp;
if (/^>(.*)/s) {
$primer_name = $1;
} else {
my $primer_seq = $_;
my $primer_seq_14bp = substr($primer_seq, 0, 14);
$primer_name_for{$primer_seq_14bp} = $primer_name;
}
}
One problem. The sequences can span multiple lines. To fix this, we'll just add to the hash when we encounter a heading or EOF.
my (%primer_name_for, $name, $seq);
while (<>) {
chomp;
if (/^>(.*)/s) {
$primer_name_for{substr($seq, 0, 14)} = $name if defined($name);
$name = $_;
$seq = '';
} else {
$seq .= $_;
}
}
$primer_name_for{substr($seq, 0, 14)} = $name if defined($name);
Upvotes: 4
Reputation: 6798
Please investigate following sample code for compliance with your problem.
My understanding that OP tries to build reverse lookup table for sequence name as hash value, based on partial DNA sequence as a hash key.
NOTE: it would be helpful for clarification of the problem to have some sample input data and desired output
use strict;
use warnings;
use feature 'say';
use Data::Dumper;
my $length = 14;
my %sequence = do { local $/; <DATA> =~ />(.*?)\n([^>]*)/gs};
$sequence{$_} =~ s/\n//g for keys %sequence;
my %lookup_table = map { substr($sequence{$_},0,$length) => $_ } keys %sequence;
say Dumper(\%sequence);
say Dumper(\%lookup_table);
exit 0;
__DATA__
>SEQUENCE_1
MTEITAAMVKELRESTGAGMMDCKNALSETNGDFDKAVQLLREKGLGKAAKKADRLAAEG
LVSVKVSDDFTIAAMRPSYLSYEDLDMTFVENEYKALVAELEKENEERRRLKDPNKPEHK
IPQFASRKQLSDAILKEAEEKIKEELKAQGKPEKIWDNIIPGKMNSFIADNSQLDSKLTL
MGQFYVMDDKKTVEQVIAEKEKEFGGKIKIVEFICFEVGEGLEKKTEDFAAEVAAQL
>SEQUENCE_2
SATVSEINSETDFVAKNDQFIALTKDTTAHIQSNSLQSVEELHSSTINGVKFEEYLKSQI
ATIGENLVVRRFATLKAGANGVVNGYIHTNGRVGVVIAAACDSAEVASKSRDLLRQICMH
Output
$VAR1 = {
'SEQUENCE_2' => 'SATVSEINSETDFVAKNDQFIALTKDTTAHIQSNSLQSVEELHSSTINGVKFEEYLKSQIATIGENLVVRRFATLKAGANGVVNGYIHTNGRVGVVIAAACDSAEVASKSRDLLRQICMH',
'SEQUENCE_1' => 'MTEITAAMVKELRESTGAGMMDCKNALSETNGDFDKAVQLLREKGLGKAAKKADRLAAEGLVSVKVSDDFTIAAMRPSYLSYEDLDMTFVENEYKALVAELEKENEERRRLKDPNKPEHKIPQFASRKQLSDAILKEAEEKIKEELKAQGKPEKIWDNIIPGKMNSFIADNSQLDSKLTLMGQFYVMDDKKTVEQVIAEKEKEFGGKIKIVEFICFEVGEGLEKKTEDFAAEVAAQL'
};
$VAR1 = {
'MTEITAAMVKELRE' => 'SEQUENCE_1',
'SATVSEINSETDFV' => 'SEQUENCE_2'
};
Upvotes: 1