Reputation: 19

Counting number of different words in a txt file in Bash

Well, I do not know much about programming at bash, I'm new at it so I'm struggling to find a code to iterate all the lines in a txt file, and count how many words are different. Example: If a txt file has "Nory was a Catholic because her mother was a Catholic"
So the result must be 7

Upvotes: 0

Answers (5)

pooley1994

Reputation: 973

Sure. I assume you are ok with defining "words" as things that are separated by space? In which case, try something like this:

cat filename | sed -r -e "s/[ ]+/ /g" -e "s/ /\n/g" | sort -u | wc -l

This command says:

Dump contents of filename
Replace multiple spaces with a single space
Replace spaces with newline
Sort and "uniquify" the list
Print out the count of lines

Per the comment, you can technically get away without using cat if you'd like, with the following:

sed -r -e "s/[ ]+/ /g" -e "s/ /\n/g" filename | sort -u | wc -l

Further, from another comment, you could optionally use tr (importantly with it's -s flag to handle repeated spaces) instead of sed with something like:

tr -s " " "\n" < filename | sort -u | wc -l

The moral of the story is there are several ways this kind of thing can be accomplished, not to mention the other full answers that are given here :-) My personal favorite answer at this point is Ed Morton's which I've upvoted accordingly.

Upvotes: 2

Léa Gris

Reputation: 19555

You could also lowercase the text so words compares regardless of casing.

Also filter words with the [:alnum:] character class, rather than [a-zA-Z0-9_] that is only valid for US-ASCII, and will fail dramatically with Greek or Turkish.

#!/usr/bin/env bash
echo "The uniq words are the words that appears at least once, regardless of casing." |
  # Turn text to lowercase
  tr '[:upper:]' '[:lower:]' |
  # Split alphanumeric with newlines
  tr -sc '[:alnum:]' '\n' |
  # Sort uniq words
  sort -u |
  # Count lines of unique words
  wc -l

Upvotes: 1

Ed Morton

Reputation: 203615

$ grep -o '[^[:space:]]*' file | sort -u | wc -l
7

Upvotes: 4

KamilCuk

Reputation: 141060

I would do it like so, with comments:

echo "Nory was a Catholic because her mother was a Catholic" |
# tr replace
# -s - squeeze
# -c - complementary
# [a-zA-Z0-9_] - all letters, number and underscore
# but complementary set, so all non letters, not numbers and not underscores.
# replace them by newline
tr -sc '[a-zA-Z0-9_]' '\n' |
# and sort unique and display count
sort -u | wc -l

Tested on repl bash.

Decided to use [a-zA-Z0-9_], because this is how GNU sed \w extension matches a word.

Upvotes: 0

Travis G

Reputation: 24

cat yourfile.txt | xargs -n1 | sort | uniq -c > youroutputfile.txt

xargs -n1 = put one word per line

sort = sorts

uniq -c = counts occurrences of distinct values

source

Upvotes: -1

Counting number of different words in a txt file in Bash

Answers (5)

Related Questions