Reputation: 113
I need to read all the words from a file to a variable. In addition to that I need to store each word only once. The selection will not be key sensitive so "Hello", "hello", "hElLo" and "HELLO" will count as the same word. If a word has an apostrophe, like the word "it's", it must ignore the "'s" and only count the "it" as a word.
To do that I used the following command:
#Stores the words of the file without duplicates
WORDS=`grep -o -E '\w+' $1 | sort -u -f`
The first two criteria are met but this method counts words like "it's" as two separate words "it" and "s".
Upvotes: 3
Views: 2088
Reputation: 2327
Maybe, something like that:
WORDS=$(grep -o -E "(\w|')+" words.txt | sed -e "s/'.*\$//" | sort -u -f)
UPDATE
Explanations:
var=$(...command...)
: Execute command (newer and better solution than `...command...`) and put standard output to var
variablegrep -o -E "(\w|')+" words.txt
: Read file words.txt
and apply grep filter
grep
filter is : print only found tokens (-o
) from extended (-E
) rational expression (\w|')+
. This expression is form extract characters of words (\w
: synonym of [_[:alnum:]]
, alnum
is for alpha-numeric characters like [0-9a-zA-Z]
for english/american but extended to many other characters for other languages) or (|
) simple cote ('
), one or more times (+
) : see man grep
grep
is the standard input of next command sed
with the pipe (|
)sed -e "s/'.*\$//"
: Execute (-e
) expression s/'.*\$//
:
sed
expression is substitution (s/
) of '.*\$
(simple cote followed by zero or any characters to the end of line) by empty string (between the last two slashes (//
)) : see man sed
sed
is the standard input of next command sort
with the pipe (|
)sed
and remove doubles (-u
: uniq) and do not make a differences between upper and lower characters (case) : see man sort
Upvotes: 2