Reputation: 447

Counting the total number of ocurrences of a set of words using grep in bash

I have a set of words happy enjoy dead cheerful

I want to count the total number of appearances of these words in a text file q.txt

Right now I am using grep to count the individual words and then adding them, but it is not at all efficient for adding more words

Upvotes: 1

Answers (4)

glenn jackman

Reputation: 247210

words="happy enjoy dead cheerful"
regex=$(set -- $words; IFS='|'; echo "$*")
grep -o -E -w "$regex" q.txt | sort | uniq -c

With the total:

while read -r count word; do
    (( t += count ))
    printf "%8d %s\n" "$count" "$name"
done < <(grep -o -E -w "$regex" q.txt | sort | uniq -c)
echo total is $t

Upvotes: 3

glenn jackman

Reputation: 247210

Timing some of the answers.

I concatenated /usr/share/dict/words a bunch of times to create a large file

$ ll words
-rw-rw-r-- 1 jackman jackman 653M Sep 19 11:10 words

grep|sort|uniq

$ time sh -c 'grep -oEw "happy|enjoy|dead|cheerful" words | sort | uniq -c'
    729 cheerful
   1458 dead
    729 enjoy
    729 happy

real    0m2.232s
user    0m2.148s
sys 0m0.084s

awk

$ time awk -v RS='[,."?!]*[[:space:]]+' '/happy|enjoy|dead|cheerful/{ a[$0]++ } END{ for(i in a) print i,a[i] }' words
deaden 729
deadliness 729
deader 729
deadline 729
deadbeats 729
deadens 729
cheerfuller 729
deadened 729
deadliest 729
enjoyable 729
deadlock's 729
dead's 729
deadbolts 729
cheerfulness 729
deadlier 729
deadbolt's 729
deadbeat's 729
happy 729
deadwood 729
cheerfully 729
enjoyment's 729
deadpan's 729
deadbeat 729
deadbolt 729
deadliness's 729
cheerfullest 729
enjoyments 729
deadlock 729
enjoyment 729
deadpan 729
deadpanned 729
dead 729
enjoy 729
deadest 729
deadpanning 729
deadly 729
enjoys 729
slaphappy 729
unhappy 729
deadlocks 729
deadlines 729
deadpans 729
deadening 729
enjoyed 729
deadlocked 729
deadwood's 729
cheerfulness's 729
deadline's 729
enjoying 729
deadlocking 729
cheerful 729

real    0m46.817s
user    0m46.720s
sys 0m0.228s

awk but simplified, since we know the structure of the file is one word per line, and avoiding regular expression matching.

$ time awk -v w="happy enjoy dead cheerful" '
    BEGIN {n=split(w,a); for (i=1; i<=n; i++) words[a[i]]=1} 
    $1 in words {count[$1]++} 
    END {for (word in count) print count[word], word}
' words
729 cheerful
729 enjoy
729 happy
729 dead

real    0m13.781s
user    0m13.652s
sys 0m0.164s

would it be faster to do straight string equality comparison since the list of "needle" words is short?

$ time awk '                                 
    $1 == "happy" || $1 == "enjoy" || $1 == "dead" || $1 == "cheerful" {count[$1]++} 
    END {for (word in count) print count[word], word}
' words
729 cheerful
729 enjoy
729 happy
729 dead

real    0m32.738s
user    0m32.668s
sys 0m0.156s

No. It seems the in operator is quick.

Surprisingly (to me), grepping the file multiple times is still quite fast:

$ time sh -c 'for i in happy enjoy dead cheerful; do echo "$(grep -cFx "$i" words) $i"; done'
729 happy
729 enjoy
729 dead
729 cheerful

real    0m2.480s
user    0m2.132s
sys 0m0.348s

Anyway, the grep|sort|uniq pipeline is speediest so far.

A new winner: grepping the file multiple times but with different options:

$ time sh -c 'for i in happy enjoy dead cheerful; do echo "$(grep -cw "$i" words) $i"; done'
729 happy
729 enjoy
1458 dead
729 cheerful

real    0m1.708s
user    0m1.348s
sys 0m0.356s

Upvotes: 0

RomanPerekhrest

Reputation: 92904

With single awk process.
Besides, I believe, that this will go much faster on "big" files comparatively with grep + sort + uniq:

Sample q.txt:

I thought that the aim of life is to be happy. Till you not dead -  you enjoy of life and feeling cheerful.
Just enjoy and then dead ...
Everyone want to be happy. Am I happy?
Just remember that we'll all die. Live like dead man, striving to recreate hisself ... and not just dreaming about cheerful, 
enjoy, happy ...

awk -v RS='[,."?!]*[[:space:]]+' '/happy|enjoy|dead|cheerful/{ a[$0]++ }
           END{ for(i in a) print i,a[i] }' q.txt

The output:

cheerful 2
enjoy 3
happy 4
dead 3

Upvotes: 0

David Jenkins

Reputation: 471

what do you mean by total no. of appearances? Do you want to output the total of each one separately or the total of all the words combine?

I would do something like this:

put the words you want to count in a separate file, words.txt, one per line. Then if you want to output each individual word with its count:

for i in `cat words.txt`; do
    echo -n "$i - "
    grep -c $i q.txt
done

If you just want the sum of all the numbers, maybe something like this:

for i in `cat words.txt`; do
    grep -c $i q.txt
done| awk '{SUM += $1} END {print SUM}'

Upvotes: 0

Counting the total number of ocurrences of a set of words using grep in bash

Answers (4)

Related Questions