Reputation: 2502
I would like to take a list of strings and keep only those that have no substrings elsewhere in the list. To illustrate, I have this list:
apple
applesauce
kiwi
mango
mangoes
mangosteen
oranges
pineapples
I want to reduce it down to a list of strings which do not have any substrings elsewhere in the list. So, the resulting list would be:
apple
kiwi
mango
oranges
Note that applesauce
and pineapples
were removed because apple
is elsewhere in the list and is a substring of both those words.
I found a similar question here but it seems to be aimed specifically at prefixes, e.g. ablaze, able, abler, ablest
. Based on that method, I tried the following with a pre-sorted copy of my list and it just printed the entire list, without even removing applesauce
which I thought it would:
awk '$1~r && p in k { next } { k[$1]++; print; r= "^" $1; p=$1 }' fruitsorted.txt
Even if it had worked as I expected, it would still miss pineapple
in my list.
Note that in an extreme case, if the list contained all the letters of the alphabet (or ASCII char set I guess) each on a separate line, then regardless of what else was in the list, the ouput would just be the alphabet/ char set.
Also, my starting list is unsorted. I don't really care if the resulting list is sorted although that's obviously trivial with sort
.
Ideally I would like a somewhat compact shell command/ sequence of stuff like grep/ sort/ awk, as opposed to a longer form Perl/ Python/ whatever script which I already know how to implement.
Thanks.
Update
As pointed out by Ed Morton below, even sorting the list may mess up some basic approaches, e.g. in the following example, approaches that are assuming a sorted list will probably fail to remove berryplum
since its substring plum
comes after it. The second approach shown by 123 handles this case.
apple
applesauce
berryplum
kiwi
mango
mangoes
mangosteen
oranges
pineapples
plum
Upvotes: 2
Views: 1236
Reputation: 103844
Given:
$ cat f1
apple
applesauce
berryplum
kiwi
mango
mangoes
mangosteen
oranges
pineapples
plum
You can use a little more looping to avoid reading the file twice or being concerned about order:
$ awk '{words[$1]}
END{
for (e in words)
for (f in words)
if (f!=e && index(e,f))
not[e]
for (e in words)
if (!(e in not))
print e}' f1
mango
plum
apple
oranges
kiwi
Upvotes: 0
Reputation: 11216
If the list is sorted it's pretty simple
awk '{for(i in a)if(index($0,i))next;a[$0]}1' file
apple
kiwi
mango
oranges
Basically just loops over an array for each line, and checks if elements exist in the line. Adds to array if this is not the case.
For unsorted list this should work
awk '{for(i in a){if(index(i,$0)&&$0!=i)delete a[i];if(index($0,i))next}a[$0];next}
END{for(i in a)print i}' file
Tested on Wordlist for performance.
real 0m29.932s
user 0m29.918s
sys 0m0.008s
Upvotes: 2
Reputation: 203532
$ awk '
NR==FNR { fruits[$0]; next }
{
for (fruit in fruits) {
if ((fruit != $0) && index($0,fruit)) {
next
}
}
final[$0]
}
END {
for (fruit in final) {
print fruit
}
}
' file file
mango
apple
oranges
kiwi
You can cram it all onto one line if you find that valuable:
awk 'NR==FNR{fruits[$0];next} {for (fruit in fruits) if ((fruit != $0) && index($0,fruit)) next; final[$0]} END{for (fruit in final) print fruit}' file file
Upvotes: 1
Reputation: 21965
For an unsorted list this may help :
awk 'NR==FNR{f1[NR]=$0;f2[$0]}
END{
for(i=0;i<=NR;i++){
for(j in f2){
if(match(f1[i],j)>=1){
if(length(j)<length(f1[i])){
f1[i]="nullfruit"
}
}
}
}
for(i=0;i<=NR;i++){
if(f1[i]!="nullfruit"){
print f1[i];
}
}
}' filename
apple
kiwi
mango
oranges
Note: pretty sure more subtle solutions exists.
Upvotes: 0