showkey
showkey

Reputation: 358

To count words in file with grep or sed?

Here is test sample file--rime.txt.

rime.txt

1.to count all words in the file.

wc -w rime.txt
4081 rime.txt
awk 'BEGIN{num=0}{split($0, A);n=length(A);num=num+n;}END{print num}'  rime.txt
4081

grep -Ec  '\w' rime.txt
672

Why the total words is 672 with grep?
How to count it with sed?

2.to count words per line

awk '{split($0, A);print length(A)}'  rime.txt

How to do it with sed?

Upvotes: 0

Views: 3199

Answers (4)

James Brown
James Brown

Reputation: 37404

If you want to use grep for the job, first form a regexp to resemble a word, I'll just use this: [a-zA-Z'-] and let your figure out a better one. Then use grep -o for matching:

   -o, --only-matching
          Print only the matched (non-empty) parts  of  a  matching  line,
          with each such part on a separate output line.

And finally count the matches with wc -l:

$ grep -o [a-zA-Z'-] rime.txt | wc -l
4090

Upvotes: 0

Ed Morton
Ed Morton

Reputation: 203483

grep is countig lines, not words, and you would never use sed for this because sed is for simple substitutions on individual lines, that is all.

Also, those awk scripts are ridiculous. The correct way to write the first one would be awk '{num+=NF} END{print num+0}' or with GNU awk awk -v RS='[[:space:]]+' 'END{print NR+0}' and the second one is just awk '{print NF}'.

Upvotes: 4

VIPIN KUMAR
VIPIN KUMAR

Reputation: 3137

To clarify your doubt on missing words take one small example here -

$cat ff
hello vipin
kumar
good night

Clearly, 3 lines with 5 words.

try with wc -w first-

$wc -w ff
5 ff  

and the grep command that you have used -

$grep -Ec '\w' ff
3 

In your case Total line count -

$wc -l < file.txt 
833

Total blank line count -

$grep '^$' file.txt |wc -l
161

Total non-blank line count -

$grep -v '^$' file.txt |wc -l
672

That is why you are seeing 672 lines.

$echo $(expr 833 - 161)
672

As expert has already mentioned that you shouldn't use sed for this operation and grep \w will give your the line count, not word count.

Upvotes: 1

M. Becerra
M. Becerra

Reputation: 659

Because it's only counting lines not words. From the man page:

-c, --count Suppress normal output; instead print a count of matching lines for each input file. With the -v, --invert-match option (see below), count non-matching lines.

And as you can see on the link you provided, there are 834 lines and 672 SLOC (Source lines of code), and that last measurement is the one grep uses.

Upvotes: 1

Related Questions