Reputation: 109
In Perl, I'd like to remove all elements from an array where another element of the same array is a non-empty substring of said element.
Say I have the array
@itemlist = ("abcde", "ab", "khi", "jklm");
In this instance I would like to have the element "abcde"
removed, because "ab"
is a substring of "abcde"
.
I could make a copy of the array (maybe as a hash?), iterate over it, try to index with every element of the original array and remove it, but there has to be a more elegant way, no?
Thanks for your help!
Edited for clarity a bit.
Upvotes: 2
Views: 634
Reputation: 75
I had the inverse problem: removing from the list strings which are substrings of other strings. Here is my not-too-elegant solution.
sub remove_substrings_from_list {
my @list = @_;
my @vals_without_superstrings;
my %hash_of_others;
for ( 0 .. $#list ) {
my $a = shift @list;
$hash_of_others{$a} = [ @list ];
push @list, $a;
}
foreach my $k ( keys %hash_of_others ) {
push @vals_without_superstrings, $k unless grep { index( $_, $k ) != -1 } @{ $hash_of_others{$k} };
}
return @vals_without_superstrings;
}
Upvotes: 0
Reputation: 1
The following will remove the substring from the array.
#!/usr/bin/perl
use strict;
use warnings;
my @ar=("asl","pwe","jsl","nxu","sl","baks","ak");
foreach my $i (@ar){
my $p = grep /$i/, @ar;
if ( $p == 1 ){
print "$i" , "\n";
}
}
Upvotes: 0
Reputation: 67930
You can use a hash to count substrings of all the words. Any word in the list that has a higher count than one is then a substring of another word. The minimum length of the substrings is two in this example:
use strict;
use warnings;
use feature 'say';
my @list = qw(abcde ab foo foobar de oba cd xs);
my %count;
for my $word (@list) {
my $len = length $word;
$count{$word}++;
for my $start (0 .. $len - 2) {
for my $long (2 .. $len - 2) {
my $sub = substr($word, $start, $long);
$count{$sub}++;
}
}
}
say for grep $count{$_} == 1, @list;
Output:
abcde
foobar
xs
Upvotes: 0
Reputation: 948
wdebeaum's answer is the solution to use, not the one below, but I learned something by doing it and perhaps someone else will too. After I had written mine I decided to test it on lists of several thousand elements.
b.pl:
#!/usr/bin/perl
use strict;
use warnings;
my @itemlist = <>;
for(@itemlist) { chomp; }
my $regex;
if(defined $ENV{wdebeaum}) {
# wdebeaum's solution
my $alternation = join('|', map(quotemeta, @itemlist));
$regex = qr/(?:$alternation).|.(?:$alternation)/;
} else {
# my solution
$regex = join "|", map {qq{(?:\Q$_\E.)|(?:.\Q$_\E)}} @itemlist;
}
my @result = grep !/$regex/, @itemlist;
print scalar @itemlist, "\t", scalar @result, "\n";
I generated a list of 5000 random words.
sort -R /usr/share/dict/american-english|head -5000 > some-words
For small lists both solutions seem fine.
$ time head -200 some-words | wdebeaum=1 ./b.pl
200 198
real 0m0.012s
user 0m0.004s
sys 0m0.004s
$ time head -200 some-words | ./b.pl
200 198
real 0m0.068s
user 0m0.060s
sys 0m0.004s
But for larger lists, wdebeaum's is clearly better.
$ time cat some-words | wdebeaum=1 ./b.pl
5000 1947
real 0m0.068s
user 0m0.064s
sys 0m0.000s
$ time cat some-words | ./b.pl
5000 1947
real 0m8.305s
user 0m8.277s
sys 0m0.012s
I think the reason for the difference, is that even though both regular expressions have the same number of possible paths, my regex has more paths that have to be tried, since it has the same number of .
s as paths, while wdebebaum's has only two.
Upvotes: 0
Reputation:
Well, I wouldn't call this elegant, but here goes:
#!usr/bin/perl
use strict;
use warnings;
my @itemlist = ("abcde", "ab", "khi", "jklm");
@itemlist = grep {
@itemlist ~~ sub {$_ !~ /(?:.\Q$_[0]\E|\Q$_[0]\E.)/}
} @itemlist;
print "@itemlist";
It relies on a rather obscure behavior of smart match: if the left argument is an array and the right argument a sub, it calls the sub for each element, and the final result is true only if the sub returns true for each element.
Explanation: for each element of the array, it checks that no other element is a substring of that element (requiring at least one additional character so that elements won't match themselves).
Note: wdebeaum's answer is probably the one I would prefer in the real world. Still, it is kind of interesting the strange things one can do with smart match.
Upvotes: 1
Reputation: 4231
You could construct a regex from all the items and throw out anything that matches:
$alternation = join('|', map(quotemeta, @itemlist));
@itemlist = grep !/($alternation).|.($alternation)/, @itemlist;
The ().|.()
thing just ensures that an item doesn't match itself.
Upvotes: 3