Reed Debaets
Reed Debaets

Reputation: 522

find extra, missing, invalid strings when comparing two lists in perl

List-1    List-2
one       one
two       three
three     three
four      four
five      six
six       seven
eight     eighttt
nine      nine

Looking to output

one       | one        PASS
two       | *               FAIL MISSING
three     | three      PASS
*         | three           FAIL EXTRA
four      | four       PASS
five      | *               FAIL MISSING
six       | six        PASS
*         | seven           FAIL EXTRA
eight     | eighttt         FAIL INVALID
nine      | nine       PASS

Actually the return from my current solution is a reference to the two modified lists and a reference to a "fail" list describing the failure for the index as either "no fail", "missing", "extra", or "invalid" which is also (obviously) fine output.

My current solution is:

sub compare {
    local $thisfound = shift;
    local $thatfound = shift;
    local @thisorig = @{ $thisfound };
    local @thatorig = @{ $thatfound };
    local $best = 9999; 

    foreach $n (1..6) {
        local $diff = 0;
        local @thisfound = @thisorig;
        local @thatfound = @thatorig;
        local @fail = ();
        for (local $i=0;$i<scalar(@thisfound) || $i<scalar(@thatfound);$i++) {
            if($thisfound[$i] eq $thatfound[$i]) { 
                $fail[$i] = 'NO_FAIL';
                next;
            }
            if($n == 1) {      # 1 2 3
                next unless __compare_missing__();
                next unless __compare_extra__();
                next unless __compare_invalid__();
            } elsif($n == 2) { # 1 3 2
                next unless __compare_missing__();
                next unless __compare_invalid__();
                next unless __compare_extra__();
            } elsif($n == 3) { # 2 1 3
                next unless __compare_extra__();
                next unless __compare_missing__();
                next unless __compare_invalid__();
            } elsif($n == 4) { # 2 3 1
                next unless __compare_extra__();
                next unless __compare_invalid__();
                next unless __compare_missing__();
            } elsif($n == 5) { # 3 1 2
                next unless __compare_invalid__();
                next unless __compare_missing__();
                next unless __compare_extra__();
            } elsif($n == 6) { # 3 2 1
                next unless __compare_invalid__();
                next unless __compare_extra__();
                next unless __compare_missing__();
            }
            push @fail,'INVALID'; 
            $diff += 1;
        }
        if ($diff<$best) {
            $best = $diff;
            @thisbest = @thisfound;
            @thatbest = @thatfound;
            @failbest = @fail;
        }
    }
    return (\@thisbest,\@thatbest,\@failbest)
}

sub __compare_missing__ {
    my $j;
    ### Does that command match a later this command? ###
    ### If so most likely a MISSING command           ###
    for($j=$i+1;$j<scalar(@thisfound);$j++) {
        if($thisfound[$j] eq $thatfound[$i]) {
            $diff += $j-$i;
            for ($i..$j-1) { push(@fail,'MISSING'); }
            @end = @thatfound[$i..$#thatfound];
            @thatfound = @thatfound[0..$i-1];
            for ($i..$j-1) { push(@thatfound,'*'); }
            push(@thatfound,@end);
            $i=$j-1;
            last;
        }
    }
    $j == scalar(@thisfound);
}

sub __compare_extra__ {
    my $j;
    ### Does this command match a later that command? ###
    ### If so, most likely an EXTRA command           ###
    for($j=$i+1;$j<scalar(@thatfound);$j++) {
        if($thatfound[$j] eq $thisfound[$i]) { 
            $diff += $j-$i;
            for ($i..$j-1) { push(@fail,'EXTRA'); }
            @end = @thisfound[$i..$#thisfound];
            @thisfound = @thisfound[0..$i-1];
            for ($i..$j-1) { push (@thisfound,'*'); }
            push(@thisfound,@end);
            $i=$j-1;
            last; 
        }
    }
    $j == scalar(@thatfound);
}

sub __compare_invalid__ {
    my $j;
    ### Do later commands match?                      ###
    ### If so most likely an INVALID command          ###
    for($j=$i+1;$j<scalar(@thisfound);$j++) {
        if($thisfound[$j] eq $thatfound[$j]) { 
            $diff += $j-$i;
            for ($i..$j-1) { push(@fail,'INVALID'); }
            $i=$j-1;
            last;
        }
    }
    $j == scalar(@thisfound);
}

But this isn't perfect ... who wants to simplify and improve? Specifically ... within a single data set, one order of searching is better for a subset and another order is better for a different subset.

Upvotes: 0

Views: 1566

Answers (5)

brian d foy
brian d foy

Reputation: 132896

The trick in Perl (and similar languages) is the hash, which doesn't care about order.

Suppose that the first array is the one that hold the valid elements. Construct a hash with those values as keys:

  my @valid = qw( one two ... );
  my %valid = map { $_, 1 } @valid;

Now, to find the invalid elements, you just have to find the ones not in the %valid hash:

  my @invalid = grep { ! exists $valid{$_} } @array;

If you want to know the array indices of the invalid elements:

  my @invalid_indices = grep { ! exists $valid{$_} } 0 .. $#array;

Now, you can expand that to find the repeated elements too. Not only do you check the %valid hash, but also keep track of what you have already seen:

 my %Seen;
 my @invalid_indices = grep { ! exists $valid{$_} && ! $Seen{$_}++ } 0 .. $#array;

The repeated valid elements are the ones with a value in %Seen that is greater than 1:

 my @repeated_valid = grep { $Seen{$_} > 1 } @valid;

To find the missing elements, you look in %Seen to check what isn't in there.

 my @missing = grep { ! $Seen{$_ } } @valid;

Upvotes: 0

Reed Debaets
Reed Debaets

Reputation: 522

sub compare {
    local @d = ();

    my $this = shift;
    my $that = shift;
    my $distance = _levenshteindistance($this, $that);

    my @thisorig = @{ $this };
    my @thatorig = @{ $that };

    my $s = $#thisorig;
    my $t = $#thatorig;

    @this = ();
    @that = ();
    @fail = ();

    while($s>0 || $t>0) {
        #                  deletion,    insertion,   substitution
        my $min = _minimum($d[$s-1][$t],$d[$s][$t-1],$d[$s-1][$t-1]);
        if($min == $d[$s-1][$t-1]) {
            unshift(@this,$thisorig[$s]);
            unshift(@that,$thatorig[$t]);
            if($d[$s][$t] > $d[$s-1][$t-1]) {
                unshift(@fail,'INVALID');
            } else {
                unshift(@fail,'NO_FAIL');
            }
            $s -= 1;
            $t -= 1;
        } elsif($min == $d[$s][$t-1]) {
            unshift(@this,'*');
            unshift(@that,$thatorig[$t]);
            unshift(@fail,'EXTRA');
            $t -= 1;
        } elsif($min == $d[$s-1][$t]) {
            unshift(@this,$thisorig[$s]);
            unshift(@that,'*');
            unshift(@fail,'MISSING');
            $s -= 1;
        } else {
            die("Error! $!");
        }
    }

    return(\@this, \@that, \@fail);

}

sub _minimum {
    my $ret = 2**53;
    foreach $in (@_) {
        $ret = $ret < $in ? $ret : $in;
    }
    $ret;
}

sub _levenshteindistance {
    my $s = shift;
    my $t = shift;
    my @s = @{ $s };
    my @t = @{ $t };

    for(my $i=0;$i<scalar(@s);$i++) {
        $d[$i] = ();
    }

    for(my $i=0;$i<scalar(@s);$i++) {
        $d[$i][0] = $i # deletion
    }
    for(my $j=0;$j<scalar(@t);$j++) {
        $d[0][$j] = $j # insertion
    }

    for(my $j=1;$j<scalar(@t);$j++) {
        for(my $i=1;$i<scalar(@s);$i++) {
            if ($s[$i] eq $t[$j]) {
                $d[$i][$j] = $d[$i-1][$j-1];
            } else {
                #                    deletion,      insertion,     substitution
                $d[$i][$j] = _minimum($d[$i-1][$j]+1,$d[$i][$j-1]+1,$d[$i-1][$j-1]+1);
            }
        }
    }

    foreach $a (@d) {
        @a = @{ $a };
        foreach $b (@a) {
            printf STDERR "%2d ",$b;
        }
        print STDERR "\n";
    }

    return $d[$#s][$#t];
}

Upvotes: 0

reinierpost
reinierpost

Reputation: 8611

If the arrays contain duplicate values, the answer is quite a bit more complicated than that.

See e.g. Algorithm::Diff or read about Levenshtein distance.

Upvotes: 4

brian d foy
brian d foy

Reputation: 132896

From perlfaq4's answer to How do I compute the difference of two arrays? How do I compute the intersection of two arrays?:


Use a hash. Here's code to do both and more. It assumes that each element is unique in a given array:

@union = @intersection = @difference = ();
%count = ();
foreach $element (@array1, @array2) { $count{$element}++ }
foreach $element (keys %count) {
    push @union, $element;
    push @{ $count{$element} > 1 ? \@intersection : \@difference }, $element;
    }

Note that this is the symmetric difference, that is, all elements in either A or in B but not in both. Think of it as an xor operation.

Upvotes: -1

brian d foy
brian d foy

Reputation: 132896

From perlfaq4's answer to How can I tell whether a certain element is contained in a list or array?:


(portions of this answer contributed by Anno Siegel and brian d foy)

Hearing the word "in" is an indication that you probably should have used a hash, not a list or array, to store your data. Hashes are designed to answer this question quickly and efficiently. Arrays aren't.

That being said, there are several ways to approach this. In Perl 5.10 and later, you can use the smart match operator to check that an item is contained in an array or a hash:

use 5.010;

if( $item ~~ @array )
    {
    say "The array contains $item"
    }

if( $item ~~ %hash )
    {
    say "The hash contains $item"
    }

With earlier versions of Perl, you have to do a bit more work. If you are going to make this query many times over arbitrary string values, the fastest way is probably to invert the original array and maintain a hash whose keys are the first array's values:

@blues = qw/azure cerulean teal turquoise lapis-lazuli/;
%is_blue = ();
for (@blues) { $is_blue{$_} = 1 }

Now you can check whether $is_blue{$some_color}. It might have been a good idea to keep the blues all in a hash in the first place.

If the values are all small integers, you could use a simple indexed array. This kind of an array will take up less space:

@primes = (2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31);
@is_tiny_prime = ();
for (@primes) { $is_tiny_prime[$_] = 1 }
# or simply  @istiny_prime[@primes] = (1) x @primes;

Now you check whether $is_tiny_prime[$some_number].

If the values in question are integers instead of strings, you can save quite a lot of space by using bit strings instead:

@articles = ( 1..10, 150..2000, 2017 );
undef $read;
for (@articles) { vec($read,$_,1) = 1 }

Now check whether vec($read,$n,1) is true for some $n.

These methods guarantee fast individual tests but require a re-organization of the original list or array. They only pay off if you have to test multiple values against the same array.

If you are testing only once, the standard module List::Util exports the function first for this purpose. It works by stopping once it finds the element. It's written in C for speed, and its Perl equivalent looks like this subroutine:

sub first (&@) {
    my $code = shift;
    foreach (@_) {
        return $_ if &{$code}();
    }
    undef;
}

If speed is of little concern, the common idiom uses grep in scalar context (which returns the number of items that passed its condition) to traverse the entire list. This does have the benefit of telling you how many matches it found, though.

my $is_there = grep $_ eq $whatever, @array;

If you want to actually extract the matching elements, simply use grep in list context.

my @matches = grep $_ eq $whatever, @array;

Upvotes: 0

Related Questions