kinkybudget
kinkybudget

Reputation: 27

Using grep and regex to extract words from a file that contain only one kind of vowel

I have a large dictionary file that contains one word per line.

I want to extract all lines that contain only one kind of vowel, so "see" and "best" and "levee" and "whenever" would be extracted, but "like" or "house" or "and" wouldn't. It's fine for me having to go over the file a few times, changing the vowel I'm looking for each time.

This command: grep -io '\b[eqwrtzpsdfghjklyxcvbnm]*\b' dictionary.txt

returns no words containing any other vowels but E, but it also gives me words like BBC or BMW. How can I make the contained vowel a requirement?

Upvotes: 1

Views: 183

Answers (3)

tripleee
tripleee

Reputation: 189618

Here is an Awk attempt which collects all the hits in a single pass over the input file, then prints each bucket.

awk 'BEGIN { split("a:e:i:o:u", vowel, ":")
    c = "[b-df-hj-np-tv-z]"
    for (v in vowel)
      regex = (regex ? regex "|" : "") "^" c "*" vowel[v] c "*(" vowel[v] c "]*)*$" }
    $0 ~ regex { for (v in vowel) if ($0 ~ vowel[v]) {
        hit[v] = ( hit[v] ? hit[v] ORS : "") $0
        next } }
    END { for (v in vowel) {
        printf "=== %s ===\n", vowel[v]
        print hit[v] } }' /usr/share/dict/words

You'll notice that it prints words with syllabic y like jolly and cycle. A more complex regex should fix that, though the really thorny cases (like rhyme) need a more sophisticated model of English orthography.

The regex is clumsy because Awk does not support backreferences; an earlier version of this answer contained a simpler regex which would work with grep -E or similar, but then collect all matches in the same bucket.

Demo: https://ideone.com/wNrvPu

Upvotes: 0

melpomene
melpomene

Reputation: 85827

How about

grep -i '^[^aiou]*e[^aiou]*$'

?

Upvotes: 1

Toto
Toto

Reputation: 91488

Using -P (perl) option:

^(?=.*e)[^aiou]+$

Explanation:

^               # beginning of line
    (?=.*e)     # positive lookahead, make sure we at least 1 "e"
    [^aiou]+    # 1 or more any character that is not vowel 
$               # end of line

cat file.txt
see
best
levee
whenever
like
house
and
BBC 
BMW

grep -P '^(?=.*e)[^aiou]+$' file.txt
see
best
levee
whenever

Upvotes: 0

Related Questions