Reputation: 401

Match words in word-list and count occurrences

So I have a general text file with some writing in it, it really ranges randomly, but I also have a wordlist that I want to compare it with and count the occurrences of each word that appears in the text file that is on the word list.

For example my word list can be comprised of this:

good
bad 
cupid
banana
apple

Then I want to compare each of these individual words with my text file which may be like this:

Sometimes I travel to the good places that are good, and never the bad places that are bad. For example I want to visit the heavens and meet a cupid eating an apple. Perhaps I will see mythological creatures eating other fruits like apples, bananas, and other good fruits.

I wish my output to generate how many times each occurrence of the listed words happen. I have a way to do this is awk and a for-loop but I really wish to avoid the for-loop since it will take forever since my real words list is about 10000 words long.

So in this case my output should be (I think) 9 since it counts total occurrences of a word on that list.

By the way, the paragraph was totally random.

Upvotes: 4

Answers (4)

Hynek -Pichi- Vychodil

Reputation: 26121

For any bigger text I would definitely use this:

perl -nE'BEGIN{open my$fh,"<",shift;my@a=map lc,map/(\w+)/g,<$fh>;@h{@a}=(0)x@a;close$fh}exists$h{$_}and$h{$_}++for map lc,/(\w+)/g}{for(keys%h){say"$_: $h{$_}";$s+=$h{$_}}say"Total: $s"' word.list input.txt

Upvotes: 2

janos

Reputation: 124646

IF you don't need the detailed report, then this is a faster version of @hek2mgl's answer:

while read word; do
    grep -o $word input.txt
done < words.txt | wc -l

If you do need the detailed report, here's another version:

while read word; do
    grep -o "$word" input.txt
done < words.txt | sort | uniq -c | awk '{ total += $1; print } END { print "total:", total }'

Finally, if you want to match full words, then you need a more strict pattern in grep:

while read word; do
    grep -o "\<$word\>" input.txt
done < words.txt | sort | uniq -c | awk '{ total += $1; print } END { print "total:", total }'

However, this way the pattern banana will not match bananas in the text. If you want banana to match bananas, you could make the pattern match word beginnings like this:

while read word; do
    grep -o "\<$word" input.txt
done < words.txt | sort | uniq -c | awk '{ total += $1; print } END { print "total:", total }'

I'm not sure if it will be faster if we call grep with multiple words at the same time:

paste -d'|' - - - < words.txt | sed -e 's/ //g' -e 's/\|*$//' | while read words; do
    grep -oE "\<($words)\>" input.txt
done

This will grep for 3 words at a time. You can try adding more - for paste to match more words at once, for example:

paste -d'|' - - - - - - - - - - < words.txt | ...

In any case, I'd like to know which solution will be the fastest, this or the awk solution by @HakonHægland

Upvotes: 2

Håkon Hægland

Reputation: 40748

An Awk solution:

awk -f cnt.awk words.txt input.txt

where cnt.awk is:

FNR==NR {
    word[$1]=0
    next
}
{
    str=str $0 RS
}
END{
    for (i in word) {
        stri=str
        while(match(stri,i)) {
           stri=substr(stri,RSTART+RLENGTH)
           word[i]++
        }
    }
    for (i in word)
        print i, word[i]
}

Upvotes: 2

hek2mgl

Reputation: 157967

For small to medium size texts you could use grep in combination with wc:

cat <<EOF > word.list
good
bad 
cupid
banana
apple
EOF

cat <<EOF > input.txt
Sometimes I travel to the good places that are good, and never the bad places that are bad. For example I want to visit the heavens and meet a cupid eating an apple. Perhaps I will see mythological creatures eating other fruits like apples, bananas, and other good fruits.
EOF

while read search ; do
    echo "$search: $(grep -o $search input.txt | wc -l)" 
done < word.list | awk '{total += $2; print}END{printf "total: %s\n", total}'

Output:

good: 3
bad: 2
cupid: 1
banan: 1
apple: 2
total: 9

Upvotes: 3

Match words in word-list and count occurrences

Answers (4)

Related Questions