Reputation: 583
I need to write a command line script in linux to do the following:
read a list of words from a text file (one word per line). say w_i
for each w_i computes the word count in a different text file.
sum over these counts
some help here would be really appreciated!
Upvotes: 1
Views: 2186
Reputation: 85845
Here a one-liner using awk
that prints the word counts and the total:
awk 'NR==FNR{w[$1];next}{for(i=1;i<=NF;i++)if($i in w)w[$i]++}END{for(k in w){print k,w[k];s+=w[k]}print "Total",s}' file1 file2
hello 13
foo 20
world 13
baz
bar 20
Total 66
Note: uses Kents example input.
The more readable script version:
BEGIN {
OFS="\t" # Space the output with a tab
}
NR==FNR { # Only true in file1
word_count[$1] # Build keys for all words
next # Get next line
}
{ # In file2 here
for(i=1;i<=NF;i++) # For each word on the current line
if($i in word_count) # If the word has a key in the array
word_count[$i]++ # Increment the count
}
END { # After all files have been read
for (word in word_count) { # For each word in the array
print word,int(word_count[word]) # Print the word and the count
sum+=word_count[word] # Sum the values
}
print "Total",sum # Print the total
}
Save as script.awk
and run like:
$ awk -f script.awk file1 file2
hello 13
foo 20
world 13
baz 0
bar 20
Total 66
Upvotes: 2
Reputation: 35960
Assuming you have file words
containing one word per file, and then you have file corpus
, you can use the following command:
$ cat file | xargs -I% sh -c '{ echo "%\c"; grep -o "%" corpus | wc -l; }' | \
tee /dev/tty | awk '{ sum+=$2} END {print "Total " sum}'
On example, for file
:
car
plane
bike
And for corpus
:
car is a plane is on a car
or in the car via a plane
plane plane
car
The output would be:
$ cat file | xargs -I% sh -c '{ echo "%\c"; grep -o "%" corpus | wc -l; }' | \
tee /dev/tty | awk '{ sum+=$2} END {print "Total " sum}'
car 4
plane 4
bike 0
Total 8
Upvotes: 1
Reputation: 195179
this grep line may work for you, give it a try:
grep -oFwf wordlist textfile|wc -l
I just did this small test, it seems worked as you expected.
(PS, I insert those words in file2 using vim, so i know how many I inserted)
kent$ head file1 file2
==> file1 <==
foo
bar
baz
hello
world
==> file2 <==
foo foo foo foo foo foo foo foo foo foo foo foo foo foo foo foo foo foo foo foo bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar
hello world hello world hello world hello world hello world hello world hello world hello world hello world hello world hello world hello world hello world
blah bbbb fo bablah bbbb fo bablah bbbb fo bablah bbbb fo bablah bbbb fo bablah bbbb fo bablah bbbb fo bablah bbbb fo bablah bbbb fo bablah bbbb fo bablah bbbb fo bablah bbbb fo bablah bbbb fo bablah bbbb fo bablah bbbb fo bablah bbbb fo bablah bbbb fo bablah bbbb fo bablah bbbb fo bablah bbbb fo bablah bbbb fo bablah bbbb fo bablah bbbb fo bablah bbbb fo bablah bbbb fo bablah bbbb fo bablah bbbb fo bablah bbbb fo bablah bbbb fo bablah bbbb fo ba
kent$ grep -oFwf file1 file2|wc -l
66
Upvotes: 2