Rick
Rick

Reputation: 107

Most efficient process of matching array elements in Perl?

I have two arrays (Array1 and Array2). Array1 contains approximately 20,000 non-unique elements. Array2 contains approximately 90,000 unique elements. I am trying to count the number of elements in Array1 that also show up as elements of Array2. The perl script below does this successfully but slowly. Is there another method for counting the number of Array1 elements that exist in Array2 that is likely to be faster than this?

#!/usr/bin/perl
use strict;
use warnings;
use autodie;

# This program counts the total number of elements in one array that exist in a separate array.

my $Original_Array_File = "U:/Perl/MasterArray.txt";

# Read in the master file into an array.
open my $file, '<', $Original_Array_File or die $!;
my @Array2 = <$file>;
close $file;

my $path = "C:/Files by Year/1993";
chdir($path) or die "Cant chdir to $path $!";

for my $new_file ( grep -f, glob('*.txt') ) {
    open my ($new_fh), '<', $new_file;
    my @Array1 = <$new_file>;
    my @matched_array_count ;

    foreach @Array1 {
         ++$matched_array_count if ($_ ~~ @Array2 ) ;                            
         } 

Upvotes: 1

Views: 168

Answers (2)

ThisSuitIsBlackNot
ThisSuitIsBlackNot

Reputation: 24063

The primary source of your performance problem is:

foreach @Array1 {
    ++$matched_array_count if ($_ ~~ @Array2 );                            
} 

According to the smartmatch documentation, the behavior of

$string ~~ @array

is like

grep { $string ~~ $_ } @array

and

$string1 ~~ $string2

is like

$string1 eq $string2

Putting these together with your original code snippet, we get something like:

foreach my $string1 @Array1 {
    ++$matched_array_count if grep { $string1 eq $_ } @Array2;
}

In other words, take the first element in @Array1 and compare it to every single element in @Array2, then take the second element in @Array1 and compare it to every single element in @Array2, and so on. This works out to

scalar @Array1 * scalar @Array2

comparisons, or roughly 1.8 billion total.

The way this is usually done in Perl is with a hash. It is significantly faster to do a single hash lookup than to search through every element in an array. The basic algorithm is:

  1. Load your "haystack" (what you're searching in) into a hash.
  2. Loop through your "needles" (what you're searching for) and use exists to see if there's a matching key in the hash.

In your particular case, you would load the contents of U:/Perl/MasterArray.txt into a hash and then perform

scalar @Array1

hash lookups, or ~20,000. That will be significantly faster than what you have now.

Here is an example of searching for words in a Linux dictionary file:

#!/usr/bin/perl

use strict;
use warnings;
use 5.010;

my $file = '/usr/share/dict/linux.words';
open my $fh, '<', $file or die "Failed to open '$file': $!";

my %haystack = map { chomp; $_ => 1 } <$fh>;

my $count;
while (<DATA>) {
    chomp;
    $count++ if exists $haystack{$_};
}

say $count;

__DATA__
foo
bar
foobar
fubar

Output:

3

(Apparently, "foobar" is a word. "FUBAR" is a word, but "fubar" in lowercase is not.)

Smartmatch is experimental

You should also be aware that as of Perl 5.18.0, the smartmatch family of features are now experimental:

Smart match, added in v5.10.0 and significantly revised in v5.10.1, has been a regular point of complaint. Although there are a number of ways in which it is useful, it has also proven problematic and confusing for both users and implementors of Perl. There have been a number of proposals on how to best address the problem. It is clear that smartmatch is almost certainly either going to change or go away in the future. Relying on its current behavior is not recommended. (emphasis added)

Upvotes: 4

cs_alumnus
cs_alumnus

Reputation: 1659

It looks like you want to compare many files.

You might split this up to use a pool of threads with one thread per file. Or if you are more memory constrained you might split one of the arrays into parts and have each part sent to a different thread.

Upvotes: 0

Related Questions