user1039417
user1039417

Reputation: 109

Improving how to count duplicate elements in a hash of arrarys

My task is to loop through a list (of upto 50K) of hostnames and associated IP and MAC addresses, looking for duplicates, in an attempt to do bit of a tidy up, and I've come up with this working solution:

#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;


my %HoA = (
    host1   => [ "10.1", "ae:ab" ],
    host2   => [ "10.2", "aa:ee" ],
    host3   => [ "10.3", "aa:ee" ],
    host4   => [ "10.1", "ab:ab" ],
);

my %duplicate =();

foreach my $key ( keys %HoA ) {
  push @{ $duplicate { $HoA{$key}[0] } } , "$key", "$HoA{$key}[1]" ;
  push @{ $duplicate { $HoA{$key}[1] } } , "$key", "$HoA{$key}[0]" ;
}

foreach my $key2 ( keys %duplicate ) {
    if ( (scalar @{ $duplicate{$key2} } ) > 2  ) {
        print "Duplicate:$key2\tGroup:@{ $duplicate{$key2} }\n";
    }
}


print Dumper (\%duplicate) . "\n";

I couldn't find any examples for finding duplicate in a hash of arrays so came up with the example above, which works pretty well on the four entries I've listed.

So I was wondering if there are any better ways of doing this, and how well my code will scale to large numbers?

Any insights are gladly welcome.

Cheers,

Andy

UPDATE: I eventually went for this solution, (after a couple of weeks of playing around) and added an extra anonymous array comparison.

Thanks for all the comments, it really helps bouncing ideas around:

#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;

my (%dup) = ();
my ( $data_a, $data_b ) = ();
my ( @a,      @b )      = ();

@a = (
    [qw/ host1 10.1 ae:ab /], [qw/ host2 10.2 aa:ee /],
    [qw/ host3 10.3 aa:ee /], [qw/ host4 10.1 ab:ab /],
);

@b = (
    [qw/ host1 10.1 ae:ab /], [qw/ host3 10.2 aa:ee /],
    [qw/ host6 10.3 aa:ee /], [qw/ host4 10.1 ab:ab /],
);

foreach $data_a (@a) {
    my ( $host, $ip, $mac ) = @$data_a;
    push @{ $dup{$host} }, "$host $ip $mac";
    push @{ $dup{$ip} },   "$host $ip $mac";
    push @{ $dup{$mac} },  "$host $ip $mac";
}

foreach $data_b (@b) {
    my ( $host, $ip, $mac ) = @$data_b;
    push @{ $dup{$host} }, "$host $ip $mac";
    push @{ $dup{$ip} },   "$host $ip $mac";
    push @{ $dup{$mac} },  "$host $ip $mac";
}

print Dumper (%dup) . "\n";
#skipped scalar search

Upvotes: 1

Views: 661

Answers (3)

Brad Gilbert
Brad Gilbert

Reputation: 34120

#! /usr/bin/perl
use strict;
use warnings;

my %hosts = (
  host1 => [ "10.1", "ae:ab" ],
  host2 => [ "10.2", "aa:ee" ],
  host3 => [ "10.3", "aa:ee" ],
  host4 => [ "10.1", "ab:ab" ],
);

my (%dup_mac,%dup_ip);

while( my($hostname,$addr) = each %hosts ) {
  push @{ $dup_ip{  $addr->[0] } }, $hostname;
  push @{ $dup_mac{ $addr->[1] } }, $hostname;
}

find_dup(\%dup_mac,'MAC');
find_dup(\%dup_ip,'IP');

sub find_dup{
  my($hash,$type) = @_;
  for my $addr ( sort keys %$hash ){
    my $hosts = $hash->{$addr};
    next unless @$hosts > 1;

    print "Duplicate $type: $addr\n";
    print ' 'x4, $_, "\n" for sort @$hosts;
  }
}
Duplicate MAC: aa:ee
    host2
    host3
Duplicate IP: 10.1
    host1
    host4

Upvotes: 0

David W.
David W.

Reputation: 107040

You've pretty much got it. The best way of detecting duplicates is to create a hash. You're only looping through your structure twice which is fairly efficient. Even a million records would take less than a second to execute on a modern computer. Main issues with scaling happens when you do loops of loops.

For example, what if you decided to compare each key with each other key:

foreach my $key ( keys %HoA ) {
    foreach my $key2 (keys %HoA) {
       #Some sort of comparison between $HoA{$key} and $HoA{$key2}
    }
}

That would loop through the comparison the square of the number of entries in %HoA. By comparison, your algorithm only loops through twice the number of keys (once for each loop). Your algorithm could probably do 1,000,000 entries in less than a second. The loop of a loop could take over a day.

My only comments concern readability:

  • Why did you use $key and $key2? Took me a few microseconds to realize you didn't need both $key and $key2.
  • I would use two separate hashes instead of an array with the two hashes, and I would assign the IP address and MAC address to two temporary variables. It simplifies your syntax and makes it much easier to read.

For example:

my %ip_hash;
my %mac_hash;
foreach my $key ( keys %HoA ) {
    my $ip = $HoA{$key}[0];
    my $mac = $HoA{$key}[1];
    push @{ $ip_hash{$ip} }, $key, $mac;
    push @{ $mac_hash{$mac} }, $key, $ip;
}

I originally missed the fact you were putting the MAC address into the IP hash, and the IP address into the MAC hash. Here it's very clear.

Upvotes: 1

Borodin
Borodin

Reputation: 126722

Simply discovering duplicates could be made much more concise, but if you have a requirement to display all members of groups with multiple entries then you are close to the best that can be done. I would say, however, that %HoA is a bad name for your hash, because it should describe the contents of the hash rather than its structure. I would also prefer to see the hash element values pulled out like this

foreach my $key ( keys %HoA ) {
  my $val = $HoA{$key};
  push @{ $duplicate { $val->[0] } } , "$key", "$val->[1]" ;
  push @{ $duplicate { $val->[1] } } , "$key", "$val->[0]" ;
}

Finally, your %HoA is really just a set of records with three values each, and could just as easily be contained in an array of anonymous array. This code is equivalent to your original, and I think more readable

my @data = (
  [qw/ host1 10.1 ae:ab / ],
  [qw/ host2 10.2 aa:ee / ],
  [qw/ host3 10.3 aa:ee / ],
  [qw/ host4 10.1 ab:ab / ],
);

my %duplicate = ();

foreach my $rec ( @data) {
  my ($host, $val1, $val2) = @$rec;
  push @{$duplicate {$val1}} , "$host", "$val2" ;
  push @{$duplicate {$val2}} , "$host", "$val1" ;
}

Upvotes: 1

Related Questions