Reputation: 109
My task is to loop through a list (of upto 50K) of hostnames and associated IP and MAC addresses, looking for duplicates, in an attempt to do bit of a tidy up, and I've come up with this working solution:
#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;
my %HoA = (
host1 => [ "10.1", "ae:ab" ],
host2 => [ "10.2", "aa:ee" ],
host3 => [ "10.3", "aa:ee" ],
host4 => [ "10.1", "ab:ab" ],
);
my %duplicate =();
foreach my $key ( keys %HoA ) {
push @{ $duplicate { $HoA{$key}[0] } } , "$key", "$HoA{$key}[1]" ;
push @{ $duplicate { $HoA{$key}[1] } } , "$key", "$HoA{$key}[0]" ;
}
foreach my $key2 ( keys %duplicate ) {
if ( (scalar @{ $duplicate{$key2} } ) > 2 ) {
print "Duplicate:$key2\tGroup:@{ $duplicate{$key2} }\n";
}
}
print Dumper (\%duplicate) . "\n";
I couldn't find any examples for finding duplicate in a hash of arrays so came up with the example above, which works pretty well on the four entries I've listed.
So I was wondering if there are any better ways of doing this, and how well my code will scale to large numbers?
Any insights are gladly welcome.
Cheers,
Andy
UPDATE: I eventually went for this solution, (after a couple of weeks of playing around) and added an extra anonymous array comparison.
Thanks for all the comments, it really helps bouncing ideas around:
#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;
my (%dup) = ();
my ( $data_a, $data_b ) = ();
my ( @a, @b ) = ();
@a = (
[qw/ host1 10.1 ae:ab /], [qw/ host2 10.2 aa:ee /],
[qw/ host3 10.3 aa:ee /], [qw/ host4 10.1 ab:ab /],
);
@b = (
[qw/ host1 10.1 ae:ab /], [qw/ host3 10.2 aa:ee /],
[qw/ host6 10.3 aa:ee /], [qw/ host4 10.1 ab:ab /],
);
foreach $data_a (@a) {
my ( $host, $ip, $mac ) = @$data_a;
push @{ $dup{$host} }, "$host $ip $mac";
push @{ $dup{$ip} }, "$host $ip $mac";
push @{ $dup{$mac} }, "$host $ip $mac";
}
foreach $data_b (@b) {
my ( $host, $ip, $mac ) = @$data_b;
push @{ $dup{$host} }, "$host $ip $mac";
push @{ $dup{$ip} }, "$host $ip $mac";
push @{ $dup{$mac} }, "$host $ip $mac";
}
print Dumper (%dup) . "\n";
#skipped scalar search
Upvotes: 1
Views: 661
Reputation: 34120
#! /usr/bin/perl
use strict;
use warnings;
my %hosts = (
host1 => [ "10.1", "ae:ab" ],
host2 => [ "10.2", "aa:ee" ],
host3 => [ "10.3", "aa:ee" ],
host4 => [ "10.1", "ab:ab" ],
);
my (%dup_mac,%dup_ip);
while( my($hostname,$addr) = each %hosts ) {
push @{ $dup_ip{ $addr->[0] } }, $hostname;
push @{ $dup_mac{ $addr->[1] } }, $hostname;
}
find_dup(\%dup_mac,'MAC');
find_dup(\%dup_ip,'IP');
sub find_dup{
my($hash,$type) = @_;
for my $addr ( sort keys %$hash ){
my $hosts = $hash->{$addr};
next unless @$hosts > 1;
print "Duplicate $type: $addr\n";
print ' 'x4, $_, "\n" for sort @$hosts;
}
}
Duplicate MAC: aa:ee
host2
host3
Duplicate IP: 10.1
host1
host4
Upvotes: 0
Reputation: 107040
You've pretty much got it. The best way of detecting duplicates is to create a hash. You're only looping through your structure twice which is fairly efficient. Even a million records would take less than a second to execute on a modern computer. Main issues with scaling happens when you do loops of loops.
For example, what if you decided to compare each key with each other key:
foreach my $key ( keys %HoA ) {
foreach my $key2 (keys %HoA) {
#Some sort of comparison between $HoA{$key} and $HoA{$key2}
}
}
That would loop through the comparison the square of the number of entries in %HoA. By comparison, your algorithm only loops through twice the number of keys (once for each loop). Your algorithm could probably do 1,000,000 entries in less than a second. The loop of a loop could take over a day.
My only comments concern readability:
$key
and $key2
? Took me a few microseconds to realize you didn't need both $key
and $key2
.For example:
my %ip_hash;
my %mac_hash;
foreach my $key ( keys %HoA ) {
my $ip = $HoA{$key}[0];
my $mac = $HoA{$key}[1];
push @{ $ip_hash{$ip} }, $key, $mac;
push @{ $mac_hash{$mac} }, $key, $ip;
}
I originally missed the fact you were putting the MAC address into the IP hash, and the IP address into the MAC hash. Here it's very clear.
Upvotes: 1
Reputation: 126722
Simply discovering duplicates could be made much more concise, but if you have a requirement to display all members of groups with multiple entries then you are close to the best that can be done. I would say, however, that %HoA
is a bad name for your hash, because it should describe the contents of the hash rather than its structure. I would also prefer to see the hash element values pulled out like this
foreach my $key ( keys %HoA ) {
my $val = $HoA{$key};
push @{ $duplicate { $val->[0] } } , "$key", "$val->[1]" ;
push @{ $duplicate { $val->[1] } } , "$key", "$val->[0]" ;
}
Finally, your %HoA
is really just a set of records with three values each, and could just as easily be contained in an array of anonymous array. This code is equivalent to your original, and I think more readable
my @data = (
[qw/ host1 10.1 ae:ab / ],
[qw/ host2 10.2 aa:ee / ],
[qw/ host3 10.3 aa:ee / ],
[qw/ host4 10.1 ab:ab / ],
);
my %duplicate = ();
foreach my $rec ( @data) {
my ($host, $val1, $val2) = @$rec;
push @{$duplicate {$val1}} , "$host", "$val2" ;
push @{$duplicate {$val2}} , "$host", "$val1" ;
}
Upvotes: 1