Reputation: 23
need command to find the count of unique words in a file using grep
Tried using grep along with uniq and sort but need to find a way to use only grep and wc commands.these are the two ways in which am able to do but i need to do using only grep..
$ grep -oE '\w+' 'file.txt' | sort | uniq | wc -l
$ grep -oE '\w+' 'file.txt' > temp.txt && awk '!seen[$0]++' temp.txt | wc -l
Sample input file:
one two three four five
two four one six
eight three seven five
Output: unique word count: 8
Is it possible to first extract the words using the grep -oE '\w+' file.txt command then perform grep on each word to an empty file and append the word to the file if grep does not find the word to exist in that file.this way only those words which are not found in the new file will get appended to it? is it possible to do this using grep ?
Upvotes: 1
Views: 320
Reputation: 203149
What you want to do is impossible with just grep
or grep
+wc
(unless you use GNU grep
with its extensions and caveats per @jhnc's answer).
Given that, if you really just want to use one tool then using GNU Awk for multi-char RS
and assuming a file of space-separated "words" as input:
$ awk -v RS='\\s+' '{unq[$0]} END{print "unique word count:", length(unq)}' file.txt
unique word count: 8
or using your regexp for identifying a "word":
$ awk -v RS='\\w+' 'RT{unq[RT]} END{print "unique word count:", length(unq)}' file.txt
unique word count: 8
Upvotes: 1
Reputation: 16652
Since your grep
has -o
I shall assume it also has -P
and -z
:
grep -zPo '(?s)(\b\w+\b)(?!.*\b\1\b)' file.txt |
grep -zc ^
-z
to make grep
treat the entire file as a single "line" (since there should be no nulls in it)-P
to enable Perl-compatible regular expressions (PCRE) which allow lookaround assertions(?s)
- tell PCRE that .
should also match newlines(?!
... )
to find the final occurrence of each word (i.e. word not followed by anything followed by itself)
\b\w+\b
and \b\1\b
exclude partial words-o
to output each match on its own "line" (because of -z
, nulls are used as the line ending character)This will be very slow on larger files.
Upvotes: 3
Reputation: 19088
Since awk is also tagged, an approach using only (almost any) awk
, returning the length of an associative array, where the indices are the words.
% awk '{for(i=1;i<=NF;i++){A[$i]++}} END{print length(A)}' file
8
Tested with
Upvotes: 2