EA00
EA00

Reputation: 633

Perl - initialization of hash

I'm not sure how to correctly initialize my hash - I'm trying to create a key/value pair for values in coupled lines in my input file.

For example, my input looks like this:

@cluster t.18
46421 ../../../output###.txt/
@cluster t.34
41554 ../../../output###.txt/

I'm extracting the t number from line 1 (@cluster line) and matching it to output###.txt in the second line (line starting with 46421). However, I can't seem to get these values into my hash with the script that I have written.

#!/usr/bin/perl
use warnings;
use strict;

my $key;
my $value;
my %hash;

my $filename = 'input.txt';
open my $fh, '<', $filename or die "Can't open $filename: $!";

while (my $line = <$fh>) {
        chomp $line;
        if ($line =~ m/^\@cluster/) {
            my @fields = split /(\d+)/, $line;
            my $key = $fields[1];          
        }
        elsif ($line =~ m/^(\d+)/) { 
            my @output = split /\//, $line;
            my $value = $output[5];       
}          
        $hash{$key} = $value;
}

Upvotes: 2

Views: 483

Answers (1)

zdim
zdim

Reputation: 66954

It's a good idea, but your $key that is created with my in the if block is a local variable scoped to that block, masking the global $key. Inside the if block the symbol $key has nothing to do with the one you nicely declared upfront. See my in perlsub.

This local $key goes out of scope as soon as if is done and does not exist outside the if block. The global $key is again available after the if, being visible elsewhere in the loop, but is undefined since it has never been assigned to. The same goes for $value in the elsif block.

Just drop the my declaration inside the loop, thus assign to those global variables (as intended?). So, $key = ... and $value = ..., and the hash will be assigned correctly.


Note -- this is about how to get that hash assignment right. I don't know how your actual data looks and whether the line is parsed correctly. Here is a toy input.txt

@cluster t.1 
1111 ../../../output1.1.txt/
@cluster t.2 
2222 ../../../output2.2.txt/

I pick the 4th field instead of the 6th, $value = $output[3];, and add

print "$_ => $hash{$_}\n" for keys %hash;

after the loop. This prints

1 => output1.1.txt
2 => output2.2.txt

I am not sure whether this is what you want but the hash is built fine.


A comment on choice of tools in parsing

You parse the lines for numbers, by using the property of split to return the separators as well, when they are captured. That is neat, but in some sense it reverses its main purpose, which is to extract other components from the string, as delimited by the pattern. Thus it may make the purpose of the code a little bit convoluted, and you also have to index very precisely to retrieve what you need.

Instead of using split to extract the delimiter itself, which is given by a regex, why not extract it by a regex? That makes the intention crystal clear, too. For example, with input

@cluster t.10 has 4319 elements, 0 subclusters 
37652 ../../../../clust/output43888.txt 1.397428

the parsing can go as

if ($line =~ m/^\@cluster/) {
    ($key) = $line =~ /t\.(\d+)/;
}   
elsif ($line =~ m/^(\d+)/) { 
    ($value) = $line =~ m|.*/(\w+\.txt)|;
}    
$hash{$key} = $value if defined $key and defined $value;

where t\. and \.txt are added to more precisely specify the targets. If the target strings aren't certain to have that precise form, just capture \d+, and in the second case all non-space after the last /, say by m|^\d+.*/(\S+)|. We use the greediness of .*, which matches everything possible up to the thing that comes after it (a /), thus all the way to the very last /.

Then you can also reduce it to a single regex for each line, for example

if ($line =~ m/^\@cluster\s+t\.(\d+)/) {
    $key = $1;
}
elsif ($line =~ m|^\d+.*/(\w+\.txt)|) {
    $value = $1;
}

Note that I've added a condition to the hash assignment. The original code in fact assigns an undef on the first iteration, since no $value had yet been seen at that point. This is overwritten on the next iteration and we don't see it if we only print the hash afterwards. The condition also guards you against failed matches, for malformatted lines or such. Of course, far better checks can be run.

Upvotes: 6

Related Questions