Reputation: 2498
I'm working on a script that generates multiple large hash of arrays (HoAs) data structures. I'm trying to optimize my script as currently it's taking considerable amount of time to run.
I've done a bit of benchmarking. I've managed to make the script execution approx. 3.5 times faster by utilizing array references and reducing subroutine call overhead by using @_
directly instead of copying it into a variable. I've also removed unnecessary subroutines and redundant variable declarations. Despite these improvements, I'd like to make the code run even faster.
At the start of my script I parse through a large file to generate two HoA data structures. Which one of these approaches in regards to hash references is the most feasible and efficient? The HoA will look something like this:
%HoA = (
'C1' => ['1', '3', '3', '3'],
'C2' => ['3','2'],
'C3' => ['1','3','3','4','5','5'],
'C4' => ['3','3','4'],
'C5' => ['1'],
);
OPTION 1
Generate the HoAs as I parse the file (See below). Finally put the hash of arrays into a hash ref.
my $hash_ref = \%HoA;
OPTION 2
Parse the file such that each key in the HoA has a value that points to an array_ref. Finally put the hash of arrays into a hash ref.
==============
I feel like OPTION 2 is a good approach but how do I do this?
Here's how I'm currently doing it.
use strict; use warnings;
open(F1, "file.txt") or die $!;
my %HoA = ();
while (<F1>){
$_=~ s/\r//;
chomp;
my @cols = split(/\t/, $_);
push( @{$HoA{$cols[0]}}, @cols[1..$#cols]);
}
close F1;
I need an efficient data structure will facilitate me to lookup values and keys fast. Also, I need to be able to pass in the the key values (the arrays), the keys, and the HoA itself into subroutines as efficiently as possible multiple times.
Upvotes: 3
Views: 141
Reputation: 385915
%HoA
and never used.$HoA_ref
and never used it.$HoA
without declaring it. Always use use strict; use warnings;
my %HoA = ();
is silly.s///
and the chomp
;$_
when not needed, or use a meaningful variable name.All of the above and a few other improvements have been made to get:
use strict;
use warnings;
open(my $fh, '<', 'file.txt') or die $!;
my %HoA;
while (<$fh>){
s/\r?\n\z//;
my ($key, @cols) = split /\t/;
push @{ $HoA{$key} }, @cols;
}
Upvotes: 4
Reputation:
As you are having large file, rather than using while loop, I would suggest doing a complete file slurping using module File::Slurp.
File::Slurp read_file function tries to bypass perl I/O using sysread (check read_file source code) call.
my $text = read_file( $file ) ;
Upvotes: 1
Reputation: 46960
My experience is that it's best to use references wherever possible. Some additional notes:
If you need this, $_=~ s/\r//;
for Windows eol compatibility, then you need a better perl build. ActiveState's is usually the most robust. chomp
ought to take care of a terminal cr/lf, or rather the file read ought to have already converted the cr/lf pair to a lf only.
Perl shift
is O(1) and very fast. You can use this to your advantage here.
You can't tell in advance what will be fastest. Benchmarking of options is the only way to go.
Try reading the input file alone with no processing. Once the job is I/O bound, optimizing doesn't help any more.
Here is what I'd start with:
open(F, "file.txt") or die $!;
my $h = {};
while (<F>){
chomp;
my @cols = split "\t";
my $key = shift @cols;
push @{$h->{$key}}, @cols;
}
close F;
Upvotes: 2
Reputation: 4088
I think this is what you're attempting to do in your example.
open(my $fh, "<", "file.txt") or die $!;
my $HoA_ref = {}; # ref will return a HASH
while (my $line = <$fh>) {
$line =~ s/\r//;
chomp $line;
my @cols = split(/\t/, $line);
# shift off first element in the list to use
# as the key
my $key = shift(@cols);
# set value to an array ref of whatever
# is left in the list.
$HoA_ref->{$key} => [@cols];
}
close <$fh>;
It's worth noting that $key
will get overwritten if it appears more than once as you're looping through the file.
Upvotes: 1