Comparing all elements from two hashes more efficiently

Question

I have two hashes in perl that each consist of about 250,000 elements. I have to compare each element from both hashes with each other, and perform another operation on the elements that are equal to each other. I have the following code, which makes ~60 billion comparisons, so takes way too long to finish:

foreach $key1 (keys %large_hash_1)
    {
    foreach $key2 (keys %large_hash_2)
        {
        if($some_other_var{$key1} == $some_other_var{$key2}) # so actually I compare another hash variable, using the keys from %large_hash_1 and %large_hash_2
             {
             # I print some stuff here to an output file using the $key1 and $key2 variables
             }
        }
    }

Is there a way to do this more rapidly?

mob · Accepted Answer

Probably. It looks like you can reformulate the problem as

Find all pairs of keys K1 and K2 such that:

$some_other_hash{K1} == $some_other_hash{K2}

K1 exists in %hash1 and K2 exists in %hash2

so let's try an approach where you find solutions to the first condition first and then see if they satisfy the second condition. Iterating through all pairs of keys is O(n²) but we already have a strategy for finding keys that map to the same hash value quickly: use another hash!

Let's build a "reverse hash" of %some_other_hash so that $hash7{VAL} produces a list of all keys in %some_other_hash such that $some_other_hash{KEY} == VAL:

push @{$hash7{ $some_other_hash{$_} }, $_ for keys %some_other_hash;

That was an O(n) operation. Next we need to find values that map to more than one key.

foreach my $v (keys %hash7) {
    @k = @{$hash7{$v}};
    next if @k < 2;
    ...
}

If you find such a value, check if some of the keys are in %hash1 and if some are in %hash2.

foreach my $v (keys %hash7) {
    @k = @{$hash7{$v}};
    next if @k < 2;
    @k1 = grep { exists $hash1{$_} } @k;
    @k2 = grep { exists $hash2{$_} } @k;
    if (@k1 && @k2) {
        foreach my $k1 (@k1) {
            foreach my $k2 (@k2) {
                print "$k1 from %hash1 and $k2 from %hash2 ",
                      "have the same value $v in %some_other_hash
";
                ...
            }
        }
    }
}

Worst case, where it is common to find values in %some_other_hash that are mapped by more than one key, this loop is O(mn). Depending on your data, this search could be significantly faster than iterating through all pairs of keys in %hash1 and %hash2.

Comparing all elements from two hashes more efficiently

Answers (1)

Related Questions