Reputation: 781

Use grep to remove words from dictionary whose roots are already present

I am trying to write a random passphrase generator. I have a dictionary with a bunch of words and I would like to remove words whose root is already in the dictionary, so that a dictionary that looks like:

ablaze
able
abler
ablest
abloom
ably

would end up with only

ablaze
able
abloom
ably

because abler and ablest contain able which was previously used.

I would prefer to do this with grep so that I can learn more about how that works. I am capable of writing a program in c or python that will do this.

Upvotes: 3

Answers (6)

tucuxi

Reputation: 17945

This is based on the assumption that the input file is sorted. In that case, when looking up each word, all matches after the first one can be safely skipped (because they will correspond to "the same word with a different suffix").

#/bin/bash
input=$1
while read -r word ; do
    # ignore short words
    if [ ${#word} -lt 4 ] ; then continue; fi
    # output this line
    echo $word 
    # skip next lines that start with $word as prefix
    skip=$(grep -c -E -e  "^${word}" $input)
    for ((i=1; i<$skip; i++)) ; do read -r word ; done
done <$input

Call as ./filter.sh input > output

This takes somewhat less than 2 minutes on all words of 4 or more letters found in my /usr/share/dict/american-english dictionary. The algorithm is O(n²), and therefore unsuitable for large files.

However, you can speed things up a lot if you avoid using grep at all. This version takes only 4 seconds to do the job (because it does not need to scan the whole file almost once per word). Since it performs a single pass over the input, its complexity is O(n):

#/bin/bash
input=$1
while true ; do
    # use already-read word, or fail if cannot read new
    if [ -n "$next" ] ; then word=$next; unset next;
    elif ! read -r word ; then break; fi
    # ignore short words
    if [ ${#word} -lt 4 ] ; then continue; fi
    # output this word
    echo ${word}
    # skip words that start with $word as prefix
    while read -r next ; do
        unique=${next#$word}
        if [ ${#next} -eq ${#unique} ] ; then break; fi
    done
done <$input

Upvotes: 1

anubhava

Reputation: 785286

Try this BASH script:

a=()
while read -r w; do
   [[ ${#a[@]} -eq 0 ]] && a+=("$w") && continue
   grep -qvf <(printf "^%s\n" "${a[@]}") <<< "$w" && a+=("$w")
done < file

printf "%s\n" "${a[@]}"
ablaze
able
abloom
ably

Upvotes: 0

tripleee

Reputation: 189497

If the list is sorted so that shorter strings always precede longer strings, you might be able to get fairly good performance out of a simple Awk script.

awk '$1~r && p in k { next } { k[$1]++; print; r= "^" $1; p=$1 }' words

If the current word matches the prefix regex r (defined in a moment) and the prefix p (ditto) is in the list of seen keys, skip. Otherwise, add the current word to the prefix keys, print the current line, create a regex which matches the current word at beginning of line (this is now the prefix regex r) and also remember the prefix string in p.

If all the similar strings are always adjacent (as they would be if you sort the file lexically), you could do away with k and p entirely too, I guess.

awk 'NR>1 && $1~r { next } { print; r="^" $1 }' words

Upvotes: 4

FuriousGeorge

Reputation: 4681

If you just want to weed out some of the words, this gross command will work. Note that it'll throw out some legit words like best, but it's dead simple. It assumes you have a test.txt file with one word per line

egrep -v "er$|est$" test.txt >> results.txt

egrep is the same as grep -E. -v means throw out matching lines. x|y means if x or y match, and $ means end of line, so you'd be looking for words that end in er or est

Upvotes: 0

Markku K.

Reputation: 3908

Supposing you want to start with words that share the same first four (up to ten) letters, you could do something like this:

cp /usr/share/dict/words words
str="...."
for num in 4 5 6 7 8 9 10; do
    for word in `grep "^$str$" words`; do
        grep -v "^$word." words > words.tmp
        mv words.tmp words
    done
    str=".$str"
done

You wouldn't want to start with 1 letter, unless 'a' is not in your dictionary, etc.

Upvotes: 0

Tim

Reputation: 2202

It seems like you want to group adverbs together. Some adverbs, including those that can also be adjectives, use er and est to form comparisons:

able, abler, ablest
fast, faster, fastest
soon, sooner, soonest
easy, easier, easiest

This procedure is know as stemming in natural language processing, and can be achieved using a stemmer or lemmatizer. there are popular implementations in python's NLTK module but the problem is not completely solved. The best out the box stemmer is the snowball stemmer but it does not stem adverbs to their root.

import nltk

initial = '''
ablaze
able
abler
ablest
abloom
ably
fast
faster
fastest
'''.splitlines()

snowball = nltk.stem.snowball.SnowballStemmer("english")
stemmed = [snowball.stem(word) for word in initial]

print set(stemmed)

output...

set(['', u'abli', u'faster', u'abl', u'fast', u'abler', u'abloom', u'ablest', u'fastest', u'ablaz'])

the other option is to use a regex stemmer but this has its own difficulties I'm afraid.

patterns = "er$|est$"
regex_stemmer = nltk.stem.RegexpStemmer(patterns, 4)
stemmed = [regex_stemmer.stem(word) for word in initial]

print set(stemmed)

output...

set(['', 'abloom', 'able', 'abl', 'fast', 'ably', 'ablaze'])

Upvotes: 0

Use grep to remove words from dictionary whose roots are already present

Answers (6)

Related Questions