Reputation: 10535
Suppose I have file text.txt
as below:
she likes cats, and he likes cats too.
I'd like my result to look like:
she 1
likes 2
cats 2
and 1
he 1
too 1
If putting space , .
into it would make the scripts easier, that would be fine.
Is there a simple shell pipeline that could achieve this?
Upvotes: 5
Views: 7571
Reputation: 203493
With GNU awk you can just specify the Record Separator (RS) to be any sequence of non-alphabetic characters:
$ gawk -v RS='[^[:alpha:]]+' '{sum[$0]++} END{for (word in sum) print word,sum[word]}' file
she 1
likes 2
and 1
too 1
he 1
cats 2
but that won't solve your problem of how to identify "words" in general.
Upvotes: 0
Reputation: 11051
Here's a one-liner near and dear to my heart:
cat text.txt | sed 's|[,.]||g' | tr ' ' '\n' | sort | uniq -c
The sed strips punctuation (tune regex to taste), the tr puts the results one word per line.
Upvotes: 20