Reputation: 781
I am trying to write a random passphrase generator. I have a dictionary with a bunch of words and I would like to remove words whose root is already in the dictionary, so that a dictionary that looks like:
ablaze
able
abler
ablest
abloom
ably
would end up with only
ablaze
able
abloom
ably
because abler and ablest contain able which was previously used.
I would prefer to do this with grep so that I can learn more about how that works. I am capable of writing a program in c or python that will do this.
Upvotes: 3
Views: 550
Reputation: 17945
This is based on the assumption that the input file is sorted. In that case, when looking up each word, all matches after the first one can be safely skipped (because they will correspond to "the same word with a different suffix").
#/bin/bash
input=$1
while read -r word ; do
# ignore short words
if [ ${#word} -lt 4 ] ; then continue; fi
# output this line
echo $word
# skip next lines that start with $word as prefix
skip=$(grep -c -E -e "^${word}" $input)
for ((i=1; i<$skip; i++)) ; do read -r word ; done
done <$input
Call as ./filter.sh input > output
This takes somewhat less than 2 minutes on all words of 4 or more letters found in my /usr/share/dict/american-english
dictionary. The algorithm is O(n²), and therefore unsuitable for large files.
However, you can speed things up a lot if you avoid using grep at all. This version takes only 4 seconds to do the job (because it does not need to scan the whole file almost once per word). Since it performs a single pass over the input, its complexity is O(n):
#/bin/bash
input=$1
while true ; do
# use already-read word, or fail if cannot read new
if [ -n "$next" ] ; then word=$next; unset next;
elif ! read -r word ; then break; fi
# ignore short words
if [ ${#word} -lt 4 ] ; then continue; fi
# output this word
echo ${word}
# skip words that start with $word as prefix
while read -r next ; do
unique=${next#$word}
if [ ${#next} -eq ${#unique} ] ; then break; fi
done
done <$input
Upvotes: 1
Reputation: 785286
Try this BASH script:
a=()
while read -r w; do
[[ ${#a[@]} -eq 0 ]] && a+=("$w") && continue
grep -qvf <(printf "^%s\n" "${a[@]}") <<< "$w" && a+=("$w")
done < file
printf "%s\n" "${a[@]}"
ablaze
able
abloom
ably
Upvotes: 0
Reputation: 189497
If the list is sorted so that shorter strings always precede longer strings, you might be able to get fairly good performance out of a simple Awk script.
awk '$1~r && p in k { next } { k[$1]++; print; r= "^" $1; p=$1 }' words
If the current word matches the prefix regex r
(defined in a moment) and the prefix p
(ditto) is in the list of seen keys, skip. Otherwise, add the current word to the prefix keys, print the current line, create a regex which matches the current word at beginning of line (this is now the prefix regex r
) and also remember the prefix string in p
.
If all the similar strings are always adjacent (as they would be if you sort the file lexically), you could do away with k
and p
entirely too, I guess.
awk 'NR>1 && $1~r { next } { print; r="^" $1 }' words
Upvotes: 4
Reputation: 4681
If you just want to weed out some of the words, this gross command will work. Note that it'll throw out some legit words like best, but it's dead simple. It assumes you have a test.txt file with one word per line
egrep -v "er$|est$" test.txt >> results.txt
egrep is the same as grep -E
. -v
means throw out matching lines. x|y
means if x or y match, and $
means end of line, so you'd be looking for words that end in er or est
Upvotes: 0
Reputation: 3908
Supposing you want to start with words that share the same first four (up to ten) letters, you could do something like this:
cp /usr/share/dict/words words
str="...."
for num in 4 5 6 7 8 9 10; do
for word in `grep "^$str$" words`; do
grep -v "^$word." words > words.tmp
mv words.tmp words
done
str=".$str"
done
You wouldn't want to start with 1 letter, unless 'a' is not in your dictionary, etc.
Upvotes: 0
Reputation: 2202
It seems like you want to group adverbs together. Some adverbs, including those that can also be adjectives, use er and est to form comparisons:
This procedure is know as stemming in natural language processing, and can be achieved using a stemmer or lemmatizer. there are popular implementations in python's NLTK module but the problem is not completely solved. The best out the box stemmer is the snowball stemmer but it does not stem adverbs to their root.
import nltk
initial = '''
ablaze
able
abler
ablest
abloom
ably
fast
faster
fastest
'''.splitlines()
snowball = nltk.stem.snowball.SnowballStemmer("english")
stemmed = [snowball.stem(word) for word in initial]
print set(stemmed)
output...
set(['', u'abli', u'faster', u'abl', u'fast', u'abler', u'abloom', u'ablest', u'fastest', u'ablaz'])
the other option is to use a regex stemmer but this has its own difficulties I'm afraid.
patterns = "er$|est$"
regex_stemmer = nltk.stem.RegexpStemmer(patterns, 4)
stemmed = [regex_stemmer.stem(word) for word in initial]
print set(stemmed)
output...
set(['', 'abloom', 'able', 'abl', 'fast', 'ably', 'ablaze'])
Upvotes: 0