Reputation: 401
So I have a general text file with some writing in it, it really ranges randomly, but I also have a wordlist that I want to compare it with and count the occurrences of each word that appears in the text file that is on the word list.
For example my word list can be comprised of this:
good
bad
cupid
banana
apple
Then I want to compare each of these individual words with my text file which may be like this:
Sometimes I travel to the good places that are good, and never the bad places that are bad. For example I want to visit the heavens and meet a cupid eating an apple. Perhaps I will see mythological creatures eating other fruits like apples, bananas, and other good fruits.
I wish my output to generate how many times each occurrence of the listed words happen. I have a way to do this is awk
and a for-loop
but I really wish to avoid the for-loop
since it will take forever since my real words list is about 10000 words long.
So in this case my output should be (I think) 9
since it counts total occurrences of a word on that list.
By the way, the paragraph was totally random.
Upvotes: 4
Views: 335
Reputation: 26121
For any bigger text I would definitely use this:
perl -nE'BEGIN{open my$fh,"<",shift;my@a=map lc,map/(\w+)/g,<$fh>;@h{@a}=(0)x@a;close$fh}exists$h{$_}and$h{$_}++for map lc,/(\w+)/g}{for(keys%h){say"$_: $h{$_}";$s+=$h{$_}}say"Total: $s"' word.list input.txt
Upvotes: 2
Reputation: 124646
IF you don't need the detailed report, then this is a faster version of @hek2mgl's answer:
while read word; do
grep -o $word input.txt
done < words.txt | wc -l
If you do need the detailed report, here's another version:
while read word; do
grep -o "$word" input.txt
done < words.txt | sort | uniq -c | awk '{ total += $1; print } END { print "total:", total }'
Finally, if you want to match full words, then you need a more strict pattern in grep
:
while read word; do
grep -o "\<$word\>" input.txt
done < words.txt | sort | uniq -c | awk '{ total += $1; print } END { print "total:", total }'
However, this way the pattern banana
will not match bananas
in the text. If you want banana
to match bananas
, you could make the pattern match word beginnings like this:
while read word; do
grep -o "\<$word" input.txt
done < words.txt | sort | uniq -c | awk '{ total += $1; print } END { print "total:", total }'
I'm not sure if it will be faster if we call grep
with multiple words at the same time:
paste -d'|' - - - < words.txt | sed -e 's/ //g' -e 's/\|*$//' | while read words; do
grep -oE "\<($words)\>" input.txt
done
This will grep
for 3 words at a time. You can try adding more -
for paste
to match more words at once, for example:
paste -d'|' - - - - - - - - - - < words.txt | ...
In any case, I'd like to know which solution will be the fastest, this or the awk
solution by @HakonHægland
Upvotes: 2
Reputation: 40748
An Awk solution:
awk -f cnt.awk words.txt input.txt
where cnt.awk
is:
FNR==NR {
word[$1]=0
next
}
{
str=str $0 RS
}
END{
for (i in word) {
stri=str
while(match(stri,i)) {
stri=substr(stri,RSTART+RLENGTH)
word[i]++
}
}
for (i in word)
print i, word[i]
}
Upvotes: 2
Reputation: 157967
For small to medium size texts you could use grep
in combination with wc
:
cat <<EOF > word.list
good
bad
cupid
banana
apple
EOF
cat <<EOF > input.txt
Sometimes I travel to the good places that are good, and never the bad places that are bad. For example I want to visit the heavens and meet a cupid eating an apple. Perhaps I will see mythological creatures eating other fruits like apples, bananas, and other good fruits.
EOF
while read search ; do
echo "$search: $(grep -o $search input.txt | wc -l)"
done < word.list | awk '{total += $2; print}END{printf "total: %s\n", total}'
Output:
good: 3
bad: 2
cupid: 1
banan: 1
apple: 2
total: 9
Upvotes: 3