Leonil Sulude
Leonil Sulude

Reputation: 19

Counting number of different words in a txt file in Bash

Well, I do not know much about programming at bash, I'm new at it so I'm struggling to find a code to iterate all the lines in a txt file, and count how many words are different. Example: If a txt file has "Nory was a Catholic because her mother was a Catholic"
So the result must be 7

Upvotes: 0

Views: 985

Answers (5)

pooley1994
pooley1994

Reputation: 973

Sure. I assume you are ok with defining "words" as things that are separated by space? In which case, try something like this:

cat filename | sed -r -e "s/[ ]+/ /g" -e "s/ /\n/g" | sort -u | wc -l

This command says:

  • Dump contents of filename

  • Replace multiple spaces with a single space

  • Replace spaces with newline

  • Sort and "uniquify" the list

  • Print out the count of lines

Per the comment, you can technically get away without using cat if you'd like, with the following:

sed -r -e "s/[ ]+/ /g" -e "s/ /\n/g" filename | sort -u | wc -l

Further, from another comment, you could optionally use tr (importantly with it's -s flag to handle repeated spaces) instead of sed with something like:

tr -s " " "\n" < filename | sort -u | wc -l

The moral of the story is there are several ways this kind of thing can be accomplished, not to mention the other full answers that are given here :-) My personal favorite answer at this point is Ed Morton's which I've upvoted accordingly.

Upvotes: 2

L&#233;a Gris
L&#233;a Gris

Reputation: 19555

You could also lowercase the text so words compares regardless of casing.

Also filter words with the [:alnum:] character class, rather than [a-zA-Z0-9_] that is only valid for US-ASCII, and will fail dramatically with Greek or Turkish.

#!/usr/bin/env bash
echo "The uniq words are the words that appears at least once, regardless of casing." |
  # Turn text to lowercase
  tr '[:upper:]' '[:lower:]' |
  # Split alphanumeric with newlines
  tr -sc '[:alnum:]' '\n' |
  # Sort uniq words
  sort -u |
  # Count lines of unique words
  wc -l

Upvotes: 1

Ed Morton
Ed Morton

Reputation: 203615

$ grep -o '[^[:space:]]*' file | sort -u | wc -l
7

Upvotes: 4

KamilCuk
KamilCuk

Reputation: 141060

I would do it like so, with comments:

echo "Nory was a Catholic because her mother was a Catholic" |
# tr replace
# -s - squeeze
# -c - complementary
# [a-zA-Z0-9_] - all letters, number and underscore
# but complementary set, so all non letters, not numbers and not underscores.
# replace them by newline
tr -sc '[a-zA-Z0-9_]' '\n' |
# and sort unique and display count
sort -u | wc -l

Tested on repl bash.

Decided to use [a-zA-Z0-9_], because this is how GNU sed \w extension matches a word.

Upvotes: 0

Travis G
Travis G

Reputation: 24

cat yourfile.txt | xargs -n1 | sort | uniq -c > youroutputfile.txt

xargs -n1 = put one word per line

sort = sorts

uniq -c = counts occurrences of distinct values

source

Upvotes: -1

Related Questions