msallge
msallge

Reputation: 109

Remove elements from an array which have a substring that is itself an element of the array

In Perl, I'd like to remove all elements from an array where another element of the same array is a non-empty substring of said element.

Say I have the array

@itemlist = ("abcde", "ab", "khi", "jklm");

In this instance I would like to have the element "abcde" removed, because "ab" is a substring of "abcde".

I could make a copy of the array (maybe as a hash?), iterate over it, try to index with every element of the original array and remove it, but there has to be a more elegant way, no?

Thanks for your help!

Edited for clarity a bit.

Upvotes: 2

Views: 634

Answers (6)

user3507704
user3507704

Reputation: 75

I had the inverse problem: removing from the list strings which are substrings of other strings. Here is my not-too-elegant solution.

sub remove_substrings_from_list {
    my @list = @_;
    my @vals_without_superstrings;

    my %hash_of_others;
    for ( 0 .. $#list ) {
        my $a = shift @list;
        $hash_of_others{$a} = [ @list ];
        push @list, $a;
    }
    foreach my $k ( keys %hash_of_others ) {
        push @vals_without_superstrings, $k unless grep { index( $_, $k ) != -1 } @{ $hash_of_others{$k} };
    }
    return @vals_without_superstrings;
}

Upvotes: 0

aks
aks

Reputation: 1

The following will remove the substring from the array.

#!/usr/bin/perl
use strict;
use warnings;

my @ar=("asl","pwe","jsl","nxu","sl","baks","ak");
foreach my $i (@ar){
  my $p = grep /$i/, @ar;
  if ( $p == 1 ){
    print "$i" , "\n";
  }
} 

Upvotes: 0

TLP
TLP

Reputation: 67930

You can use a hash to count substrings of all the words. Any word in the list that has a higher count than one is then a substring of another word. The minimum length of the substrings is two in this example:

use strict;
use warnings;
use feature 'say';

my @list = qw(abcde ab foo foobar de oba cd xs);

my %count;

for my $word (@list) {
    my $len = length $word;
    $count{$word}++;
    for my $start (0 .. $len - 2) {
        for my $long (2 .. $len - 2) {
            my $sub = substr($word, $start, $long);
            $count{$sub}++;
        }
    }
}
say for grep $count{$_} == 1, @list;

Output:

abcde
foobar
xs

Upvotes: 0

Bribles
Bribles

Reputation: 948

wdebeaum's answer is the solution to use, not the one below, but I learned something by doing it and perhaps someone else will too. After I had written mine I decided to test it on lists of several thousand elements.

b.pl:

#!/usr/bin/perl

use strict;
use warnings;

my @itemlist = <>;
for(@itemlist) { chomp; }
my $regex;

if(defined $ENV{wdebeaum}) {
    # wdebeaum's solution
    my $alternation = join('|', map(quotemeta, @itemlist));
    $regex = qr/(?:$alternation).|.(?:$alternation)/;
} else {
    # my solution
    $regex = join "|", map {qq{(?:\Q$_\E.)|(?:.\Q$_\E)}} @itemlist;
}

my @result = grep !/$regex/, @itemlist;
print scalar @itemlist, "\t", scalar @result, "\n";

I generated a list of 5000 random words.

sort -R /usr/share/dict/american-english|head -5000 > some-words

For small lists both solutions seem fine.

$ time head -200 some-words | wdebeaum=1 ./b.pl
200 198

real    0m0.012s
user    0m0.004s
sys     0m0.004s

$ time head -200 some-words | ./b.pl
200 198

real    0m0.068s
user    0m0.060s
sys     0m0.004s

But for larger lists, wdebeaum's is clearly better.

$ time cat some-words | wdebeaum=1 ./b.pl 
5000    1947

real    0m0.068s
user    0m0.064s
sys     0m0.000s

$ time cat some-words | ./b.pl 
5000    1947

real    0m8.305s
user    0m8.277s
sys     0m0.012s

I think the reason for the difference, is that even though both regular expressions have the same number of possible paths, my regex has more paths that have to be tried, since it has the same number of .s as paths, while wdebebaum's has only two.

Upvotes: 0

user1919238
user1919238

Reputation:

Well, I wouldn't call this elegant, but here goes:

#!usr/bin/perl
use strict;
use warnings;

my @itemlist = ("abcde", "ab", "khi", "jklm");

@itemlist = grep { 
    @itemlist ~~ sub {$_ !~ /(?:.\Q$_[0]\E|\Q$_[0]\E.)/} 
} @itemlist;

print "@itemlist";

It relies on a rather obscure behavior of smart match: if the left argument is an array and the right argument a sub, it calls the sub for each element, and the final result is true only if the sub returns true for each element.

Explanation: for each element of the array, it checks that no other element is a substring of that element (requiring at least one additional character so that elements won't match themselves).

Note: wdebeaum's answer is probably the one I would prefer in the real world. Still, it is kind of interesting the strange things one can do with smart match.

Upvotes: 1

wdebeaum
wdebeaum

Reputation: 4231

You could construct a regex from all the items and throw out anything that matches:

$alternation = join('|', map(quotemeta, @itemlist));
@itemlist = grep !/($alternation).|.($alternation)/, @itemlist;

The ().|.() thing just ensures that an item doesn't match itself.

Upvotes: 3

Related Questions