Extract all the words from a text file in bash

Question

I need to read all the words from a file to a variable. In addition to that I need to store each word only once. The selection will not be key sensitive so "Hello", "hello", "hElLo" and "HELLO" will count as the same word. If a word has an apostrophe, like the word "it's", it must ignore the "'s" and only count the "it" as a word.

To do that I used the following command:

#Stores the words of the file without duplicates
WORDS=`grep -o -E '\w+' $1 | sort -u -f`

The first two criteria are met but this method counts words like "it's" as two separate words "it" and "s".

Arnaud Valmary · Accepted Answer

Maybe, something like that:

WORDS=$(grep -o -E "(\w|')+" words.txt | sed -e "s/'.*\$//" | sort -u -f)

UPDATE

Explanations:

var=$(...command...) : Execute command (newer and better solution than `...command...`) and put standard output to var variable
grep -o -E "(\w|')+" words.txt : Read file words.txt and apply grep filter
- grep filter is : print only found tokens (-o) from extended (-E) rational expression (\w|')+. This expression is form extract characters of words (\w : synonym of [_[:alnum:]], alnum is for alpha-numeric characters like [0-9a-zA-Z] for english/american but extended to many other characters for other languages) or (|) simple cote ('), one or more times (+) : see man grep
The standard ouptut of grep is the standard input of next command sed with the pipe (|)
sed -e "s/'.*\$//" : Execute (-e) expression s/'.*\$// :
- sed expression is substitution (s/) of '.*\$ (simple cote followed by zero or any characters to the end of line) by empty string (between the last two slashes (//)) : see man sed
The standard ouptut of sed is the standard input of next command sort with the pipe (|)
sort the result of sed and remove doubles (-u : uniq) and do not make a differences between upper and lower characters (case) : see man sort

Extract all the words from a text file in bash

Answers (1)

Related Questions