Reputation: 6116
I have two hashes in perl that each consist of about 250,000 elements. I have to compare each element from both hashes with each other, and perform another operation on the elements that are equal to each other. I have the following code, which makes ~60 billion comparisons, so takes way too long to finish:
foreach $key1 (keys %large_hash_1)
{
foreach $key2 (keys %large_hash_2)
{
if($some_other_var{$key1} == $some_other_var{$key2}) # so actually I compare another hash variable, using the keys from %large_hash_1 and %large_hash_2
{
# I print some stuff here to an output file using the $key1 and $key2 variables
}
}
}
Is there a way to do this more rapidly?
Upvotes: 1
Views: 179
Reputation: 118695
Probably. It looks like you can reformulate the problem as
Find all pairs of keys
K1
andK2
such that:
$some_other_hash{K1} == $some_other_hash{K2}
K1
exists in%hash1
andK2
exists in%hash2
so let's try an approach where you find solutions to the first condition first and then see if they satisfy the second condition. Iterating through all pairs of keys is O(n2) but we already have a strategy for finding keys that map to the same hash value quickly: use another hash!
Let's build a "reverse hash" of %some_other_hash
so that $hash7{VAL}
produces a list of all keys in %some_other_hash
such that $some_other_hash{KEY} == VAL
:
push @{$hash7{ $some_other_hash{$_} }, $_ for keys %some_other_hash;
That was an O(n) operation. Next we need to find values that map to more than one key.
foreach my $v (keys %hash7) {
@k = @{$hash7{$v}};
next if @k < 2;
...
}
If you find such a value, check if some of the keys are in %hash1
and if some are in %hash2
.
foreach my $v (keys %hash7) {
@k = @{$hash7{$v}};
next if @k < 2;
@k1 = grep { exists $hash1{$_} } @k;
@k2 = grep { exists $hash2{$_} } @k;
if (@k1 && @k2) {
foreach my $k1 (@k1) {
foreach my $k2 (@k2) {
print "$k1 from %hash1 and $k2 from %hash2 ",
"have the same value $v in %some_other_hash\n";
...
}
}
}
}
Worst case, where it is common to find values in %some_other_hash
that are mapped by more than one key, this loop is O(mn). Depending on your data, this search could be significantly faster than iterating through all pairs of keys in %hash1
and %hash2
.
Upvotes: 6