user1495523
user1495523

Reputation: 505

Perl merge hash of array

I have hash of array of numbers,
I would like to merge the hash elements with common numbers.

eg. Input Hash

%HoA = (
A1 => [ "1", "2", "3", "4" ],
A2 => [ "5", "6", "7", "8" ],
A3 => [ "1", "9", "10", "11" ],
A4 => [ "12", "13", "14", "10" ],
);

Output Hash

%HoA_output = (
A1 => [ "1", "2", "3", "4", "9", "10", "11", "12", "13", "14" ],
A2 => [ "5", "6", "7", "8" ],
);

I need a solution that could quickly evaluate a hash that has nearly 50k keys with 8 numbers in each array.

regards

Upvotes: 0

Views: 806

Answers (2)

Borodin
Borodin

Reputation: 126722

This is essentially a graph problem where you want to determine the sets of unconnected components

This solution uses the Graph::Undirected::Components module, whose sole purpose is to do exactly that. Hopefully it will be fast enough for your extended data set, but it is far easier for you to determine that than for me

The program creates a graph, and adds edges (connections) from every key in the data to each element of its value. Then, calling connected_components returns all the distinct sets of nodes — both keys and values — that are connected to one another

The final for loop filters the keys from the values once more using part from List::MoreUtils, based on whether the node value appears as a key in the original hash data. (You will have to adjust this if any of the key values can also appear in the values.) Then the first of the keys together with the sorted value items are used to create a new element in the %result hash

use strict;
use warnings;

use Graph::Undirected::Components;
use List::Util 'minstr';
use List::MoreUtils 'part';

my %data = (
  A1 => [  1,  2,  3,  4 ],
  A2 => [  5,  6,  7,  8 ],
  A3 => [  1,  9, 10, 11 ],
  A4 => [ 12, 13, 14, 10 ],
);

my $graph =  Graph::Undirected::Components->new;

while ( my ($k, $v) = each %data ) {
  $graph->add_edge($k, $_) for @$v;
}

my %result;
for my $component ( $graph->connected_components ) {
  my @keys_vals = part { $data{$_} ? 0 : 1 } @$component;
  my $key       = minstr @{ $keys_vals[0] };
  my @values    = sort { $a <=> $b } @{ $keys_vals[1] };
  $result{$key} = \@values;
}

use Data::Dump;
dd \%result;

output

{ A1 => [1 .. 4, 9 .. 14], A2 => [5 .. 8] }

Upvotes: 4

Sobrique
Sobrique

Reputation: 53478

OK, pretty fundamentally - this isn't an easy one, because you do need to check each element against each other to see if they're present. The best I can come up with is saving some effort by merging lists as you go, and using an index to track dupes.

I would approach it like this:

use strict;
use warnings;
use Data::Dumper;


my %HoA = (
    A1 => [ "1",  "2",  "3",  "4" ],
    A2 => [ "5",  "6",  "7",  "8" ],
    A3 => [ "1",  "9",  "10", "11" ],
    A4 => [ "12", "13", "14", "10" ],
);

print Dumper \%HoA;

my %found;

sub merge_and_delete {
    my ( $first_key, $second_key ) = @_;
    print "Merging $first_key with $second_key\n";

    #use hash to remove dupes.
    my %elements;
    foreach my $element ( @{ $HoA{$first_key} }, @{ $HoA{$second_key} } )
    {
        $elements{$element}++;
        #update index - don't want to point it to an array we're deleting
        $found{$element} = $first_key;
    }
    #sorting for neatness - you might want to do a numeric sort instead, 
    #as by default %HoA contains text elements. 
    $HoA{$first_key} = [ sort keys %elements ];
    delete $HoA{$second_key};

}

foreach my $key ( sort keys %HoA ) {
    print "$key\n";
    foreach my $element ( sort @{ $HoA{$key} } ) {
        if ( $found{$element} ) {
            #this element is present in another list, we merge.
            print "$element found in $found{$element}\n";
            merge_and_delete( $found{$element}, $key );
            last;
        }
        else {
            #add this unique match to our index
            print "$element -> $key\n";
            $found{$element} = $key;
        }
    }
}

print Dumper \%HoA;

You iterate each of the element on %HoA, and make an index table %found. This index table you use to detect if an element has already been seen, and then trigger a merge - and then rebuilt the index. You may need to watch for memory consumption on a large data set though, because your index can grow to be nearly as large as your original data set (if enough unique elements are present).

But because we stop processing on the first match, we don't need to check every key any more, and because we discard the merged array and update the index, we don't need to do an all-to-all comparison any more.

Upvotes: 3

Related Questions