Reputation: 2502

Linux: remove strings from list if they have substrings elsewhere in list

I would like to take a list of strings and keep only those that have no substrings elsewhere in the list. To illustrate, I have this list:

apple
applesauce
kiwi
mango
mangoes
mangosteen
oranges
pineapples

I want to reduce it down to a list of strings which do not have any substrings elsewhere in the list. So, the resulting list would be:

apple
kiwi
mango
oranges

Note that applesauce and pineapples were removed because apple is elsewhere in the list and is a substring of both those words.

I found a similar question here but it seems to be aimed specifically at prefixes, e.g. ablaze, able, abler, ablest. Based on that method, I tried the following with a pre-sorted copy of my list and it just printed the entire list, without even removing applesauce which I thought it would:

awk '$1~r && p in k { next } { k[$1]++; print; r= "^" $1; p=$1 }' fruitsorted.txt

Even if it had worked as I expected, it would still miss pineapple in my list.

Note that in an extreme case, if the list contained all the letters of the alphabet (or ASCII char set I guess) each on a separate line, then regardless of what else was in the list, the ouput would just be the alphabet/ char set.

Also, my starting list is unsorted. I don't really care if the resulting list is sorted although that's obviously trivial with sort.

Ideally I would like a somewhat compact shell command/ sequence of stuff like grep/ sort/ awk, as opposed to a longer form Perl/ Python/ whatever script which I already know how to implement.

Thanks.

Update

As pointed out by Ed Morton below, even sorting the list may mess up some basic approaches, e.g. in the following example, approaches that are assuming a sorted list will probably fail to remove berryplum since its substring plum comes after it. The second approach shown by 123 handles this case.

apple
applesauce
berryplum
kiwi
mango
mangoes
mangosteen
oranges
pineapples
plum

Upvotes: 2

Answers (4)

dawg

Reputation: 103844

Given:

$ cat f1
apple
applesauce
berryplum
kiwi
mango
mangoes
mangosteen
oranges
pineapples
plum

You can use a little more looping to avoid reading the file twice or being concerned about order:

$ awk '{words[$1]}
     END{
        for (e in words)
            for (f in words)
                if (f!=e && index(e,f)) 
                    not[e]   
        for (e in words)
           if (!(e in not))
               print e}' f1
mango
plum
apple
oranges
kiwi

Upvotes: 0

123

Reputation: 11216

If the list is sorted it's pretty simple

awk '{for(i in a)if(index($0,i))next;a[$0]}1' file

apple
kiwi
mango
oranges

Basically just loops over an array for each line, and checks if elements exist in the line. Adds to array if this is not the case.

For unsorted list this should work

awk '{for(i in a){if(index(i,$0)&&$0!=i)delete a[i];if(index($0,i))next}a[$0];next}
     END{for(i in a)print i}' file

Tested on Wordlist for performance.

real    0m29.932s
user    0m29.918s
sys     0m0.008s

Upvotes: 2

Ed Morton

Reputation: 203532

$ awk '
   NR==FNR { fruits[$0]; next }
   {
       for (fruit in fruits) {
           if ((fruit != $0) && index($0,fruit)) {
               next
           }
        }
        final[$0]
    }
    END {
        for (fruit in final) {
            print fruit
        }
    }
' file file
mango
apple
oranges
kiwi

You can cram it all onto one line if you find that valuable:

awk 'NR==FNR{fruits[$0];next} {for (fruit in fruits) if ((fruit != $0) && index($0,fruit)) next; final[$0]} END{for (fruit in final) print fruit}' file file

Upvotes: 1

sjsam

Reputation: 21965

For an unsorted list this may help :

awk 'NR==FNR{f1[NR]=$0;f2[$0]}
    END{
    for(i=0;i<=NR;i++){
      for(j in f2){
        if(match(f1[i],j)>=1){
            if(length(j)<length(f1[i])){
            f1[i]="nullfruit"
            }
        }
      }
    }
    for(i=0;i<=NR;i++){
         if(f1[i]!="nullfruit"){
            print f1[i];
            }
    }
    }' filename

apple
kiwi
mango
oranges

^{Note: pretty sure more subtle solutions exists.}

Upvotes: 0

Linux: remove strings from list if they have substrings elsewhere in list

Answers (4)

Related Questions