anonymous
anonymous

Reputation: 23

Get the count of unique words in a file using grep and wc

need command to find the count of unique words in a file using grep

Tried using grep along with uniq and sort but need to find a way to use only grep and wc commands.these are the two ways in which am able to do but i need to do using only grep..

$ grep -oE '\w+' 'file.txt' | sort | uniq | wc -l
$ grep -oE '\w+' 'file.txt' > temp.txt && awk '!seen[$0]++' temp.txt | wc -l

Sample input file:

one two three four five
two four one six
eight three seven five

Output: unique word count: 8

Is it possible to first extract the words using the grep -oE '\w+' file.txt command then perform grep on each word to an empty file and append the word to the file if grep does not find the word to exist in that file.this way only those words which are not found in the new file will get appended to it? is it possible to do this using grep ?

Upvotes: 1

Views: 320

Answers (3)

Ed Morton
Ed Morton

Reputation: 203149

What you want to do is impossible with just grep or grep+wc (unless you use GNU grep with its extensions and caveats per @jhnc's answer).

Given that, if you really just want to use one tool then using GNU Awk for multi-char RS and assuming a file of space-separated "words" as input:

$ awk -v RS='\\s+' '{unq[$0]} END{print "unique word count:", length(unq)}' file.txt
unique word count: 8

or using your regexp for identifying a "word":

$ awk -v RS='\\w+' 'RT{unq[RT]} END{print "unique word count:", length(unq)}' file.txt
unique word count: 8

Upvotes: 1

jhnc
jhnc

Reputation: 16652

Since your grep has -o I shall assume it also has -P and -z:

grep -zPo '(?s)(\b\w+\b)(?!.*\b\1\b)' file.txt |
grep -zc ^
  • use -z to make grep treat the entire file as a single "line" (since there should be no nulls in it)
  • use -P to enable Perl-compatible regular expressions (PCRE) which allow lookaround assertions
  • (?s) - tell PCRE that . should also match newlines
  • use a negative lookahead (?! ... ) to find the final occurrence of each word (i.e. word not followed by anything followed by itself)
    • \b\w+\b and \b\1\b exclude partial words
  • we use a lookahead so that the lookahead text is not consumed by the match and can be reused when looking for more final words
  • use -o to output each match on its own "line" (because of -z, nulls are used as the line ending character)
  • take the generated list of unique words and output the count of "lines"

This will be very slow on larger files.

Upvotes: 3

Andre Wildberg
Andre Wildberg

Reputation: 19088

Since awk is also tagged, an approach using only (almost any) awk, returning the length of an associative array, where the indices are the words.

% awk '{for(i=1;i<=NF;i++){A[$i]++}} END{print length(A)}' file
8

Tested with

  • GNU awk 3.1.8/4.2.1/5.3.0
  • nawk 20221215
  • original awk 20121220
  • mawk 20240123

Upvotes: 2

Related Questions