Reputation: 107
I have two arrays (Array1 and Array2). Array1 contains approximately 20,000 non-unique elements. Array2 contains approximately 90,000 unique elements. I am trying to count the number of elements in Array1 that also show up as elements of Array2. The perl script below does this successfully but slowly. Is there another method for counting the number of Array1 elements that exist in Array2 that is likely to be faster than this?
#!/usr/bin/perl
use strict;
use warnings;
use autodie;
# This program counts the total number of elements in one array that exist in a separate array.
my $Original_Array_File = "U:/Perl/MasterArray.txt";
# Read in the master file into an array.
open my $file, '<', $Original_Array_File or die $!;
my @Array2 = <$file>;
close $file;
my $path = "C:/Files by Year/1993";
chdir($path) or die "Cant chdir to $path $!";
for my $new_file ( grep -f, glob('*.txt') ) {
open my ($new_fh), '<', $new_file;
my @Array1 = <$new_file>;
my @matched_array_count ;
foreach @Array1 {
++$matched_array_count if ($_ ~~ @Array2 ) ;
}
Upvotes: 1
Views: 168
Reputation: 24063
The primary source of your performance problem is:
foreach @Array1 {
++$matched_array_count if ($_ ~~ @Array2 );
}
According to the smartmatch documentation, the behavior of
$string ~~ @array
is like
grep { $string ~~ $_ } @array
and
$string1 ~~ $string2
is like
$string1 eq $string2
Putting these together with your original code snippet, we get something like:
foreach my $string1 @Array1 {
++$matched_array_count if grep { $string1 eq $_ } @Array2;
}
In other words, take the first element in @Array1
and compare it to every single element in @Array2
, then take the second element in @Array1
and compare it to every single element in @Array2
, and so on. This works out to
scalar @Array1 * scalar @Array2
comparisons, or roughly 1.8 billion total.
The way this is usually done in Perl is with a hash. It is significantly faster to do a single hash lookup than to search through every element in an array. The basic algorithm is:
exists
to see if there's a matching key in the hash.In your particular case, you would load the contents of U:/Perl/MasterArray.txt
into a hash and then perform
scalar @Array1
hash lookups, or ~20,000. That will be significantly faster than what you have now.
Here is an example of searching for words in a Linux dictionary file:
#!/usr/bin/perl
use strict;
use warnings;
use 5.010;
my $file = '/usr/share/dict/linux.words';
open my $fh, '<', $file or die "Failed to open '$file': $!";
my %haystack = map { chomp; $_ => 1 } <$fh>;
my $count;
while (<DATA>) {
chomp;
$count++ if exists $haystack{$_};
}
say $count;
__DATA__
foo
bar
foobar
fubar
3
(Apparently, "foobar" is a word. "FUBAR" is a word, but "fubar" in lowercase is not.)
You should also be aware that as of Perl 5.18.0, the smartmatch family of features are now experimental:
Smart match, added in v5.10.0 and significantly revised in v5.10.1, has been a regular point of complaint. Although there are a number of ways in which it is useful, it has also proven problematic and confusing for both users and implementors of Perl. There have been a number of proposals on how to best address the problem. It is clear that smartmatch is almost certainly either going to change or go away in the future. Relying on its current behavior is not recommended. (emphasis added)
Upvotes: 4
Reputation: 1659
It looks like you want to compare many files.
You might split this up to use a pool of threads with one thread per file. Or if you are more memory constrained you might split one of the arrays into parts and have each part sent to a different thread.
Upvotes: 0