Reputation: 24423
I have a file like this:
This is a file with many words.
Some of the words appear more than once.
Some of the words only appear one time.
I would like to generate a two-column list. The first column shows what words appear, the second column shows how often they appear, for example:
this@1
is@1
a@1
file@1
with@1
many@1
words3
some@2
of@2
the@2
only@1
appear@2
more@1
than@1
one@1
once@1
time@1
words
and word
can count as two separate words.So far, I have this:
sed -i "s/ /\n/g" ./file1.txt # put all words on a new line
while read line
do
count="$(grep -c $line file1.txt)"
echo $line"@"$count >> file2.txt # add word and frequency to file
done < ./file1.txt
sort -u -d # remove duplicate lines
For some reason, this is only showing "0" after each word.
How can I generate a list of every word that appears in a file, along with frequency information?
Upvotes: 57
Views: 84765
Reputation: 81
grep -Eio "\w+" test.txt | sort | uniq -c | sort -nr
-E: extended regular expression
-i: ignore upper/lower case
-o: only outputs the match pattern
"\w": [a-zA-Z0-9_]
+: repeat the preceding character 1 or more times
sort: sort the word (alphabetic)
uniq -c: count unique words
sort -n: sort by word frequence
Upvotes: 2
Reputation: 40232
Not sed
and grep
, but tr
, sort
, uniq
, and awk
:
% (tr ' ' '\n' | sort | uniq -c | awk '{print $2"@"$1}') <<EOF
This is a file with many words.
Some of the words appear more than once.
Some of the words only appear one time.
EOF
a@1
appear@2
file@1
is@1
many@1
more@1
of@2
once.@1
one@1
only@1
Some@2
than@1
the@2
This@1
time.@1
with@1
words@2
words.@1
In most cases you also want to remove numbers and punctuation, convert everything to lowercase (otherwise "THE", "The" and "the" are counted separately) and suppress an entry for a zero length word. For ASCII text you can do all these with this modified command:
sed -e 's/[^A-Za-z]/ /g' text.txt | tr 'A-Z' 'a-z' | tr ' ' '\n' | grep -v '^$'| sort | uniq -c | sort -rn
Upvotes: 84
Reputation: 11647
This is a bit more complex task. We need to take at least the following into the account:
$ file the-king-james-bible.txt
the-king-james-bible.txt: UTF-8 Unicode (with BOM) text
The BOM is the first metacharacter in the file. If not removed, it might incorrectly affect one word.
The following is a solution with AWK.
{
if (NR == 1) {
sub(/^\xef\xbb\xbf/,"")
}
gsub(/[,;!()*:?.]*/, "")
for (i = 1; i <= NF; i++) {
if ($i ~ /^[0-9]/) {
continue
}
w = $i
words[w]++
}
}
END {
for (idx in words) {
print idx, words[idx]
}
}
It removes the BOM character and replaces punctuation characters. It does not lowercase the words. In addition, since the program was used to count the words of the Bible, it skips all verses (the if condition with continue).
$ awk -f word_freq.awk the-king-james-bible.txt > bible_words.txt
We run the program and write the output into a file.
$ sort -nr -k 2 bible_words.txt | head
the 62103
and 38848
of 34478
to 13400
And 12846
that 12576
in 12331
shall 9760
he 9665
unto 8942
With sort
and head
, we find the top ten most frequent words in the Bible.
Upvotes: 1
Reputation:
If I have the following text in my file.txt.
This is line number one
This is Line Number Tow
this is Line Number tow
I can find the frequency of each word using the following cmd.
cat file.txt | tr ' ' '\n' | sort | uniq -c
output :
3 is
1 line
2 Line
1 number
2 Number
1 one
1 this
2 This
1 tow
1 Tow
Upvotes: 3
Reputation: 109
awk '{
BEGIN{word[""]=0;}
{
for (el =1 ; el <= NF ; ++el) {word[$el]++ }
}
END {
for (i in word) {
if (i !="")
{
print word[i],i;
}
}
}' file.txt | sort -nr
Upvotes: -1
Reputation: 8712
You can use tr for this, just run
tr ' ' '\12' <NAME_OF_FILE| sort | uniq -c | sort -nr > result.txt
Sample Output for a text file of city names:
3026 Toronto
2006 Montréal
1117 Edmonton
1048 Calgary
905 Ottawa
724 Winnipeg
673 Vancouver
495 Brampton
489 Mississauga
482 London
467 Hamilton
Upvotes: 14
Reputation: 6694
#!/usr/bin/env bash
declare -A map
words="$1"
[[ -f $1 ]] || { echo "usage: $(basename $0 wordfile)"; exit 1 ;}
while read line; do
for word in $line; do
((map[$word]++))
done;
done < <(cat $words )
for key in ${!map[@]}; do
echo "the word $key appears ${map[$key]} times"
done|sort -nr -k5
Upvotes: 0
Reputation: 717
Let's do it in Python 3!
"""Counts the frequency of each word in the given text; words are defined as
entities separated by whitespaces; punctuations and other symbols are ignored;
case-insensitive; input can be passed through stdin or through a file specified
as an argument; prints highest frequency words first"""
# Case-insensitive
# Ignore punctuations `~!@#$%^&*()_-+={}[]\|:;"'<>,.?/
import sys
# Find if input is being given through stdin or from a file
lines = None
if len(sys.argv) == 1:
lines = sys.stdin
else:
lines = open(sys.argv[1])
D = {}
for line in lines:
for word in line.split():
word = ''.join(list(filter(
lambda ch: ch not in "`~!@#$%^&*()_-+={}[]\\|:;\"'<>,.?/",
word)))
word = word.lower()
if word in D:
D[word] += 1
else:
D[word] = 1
for word in sorted(D, key=D.get, reverse=True):
print(word + ' ' + str(D[word]))
Let's name this script "frequency.py" and add a line to "~/.bash_aliases":
alias freq="python3 /path/to/frequency.py"
Now to find the frequency words in your file "content.txt", you do:
freq content.txt
You can also pipe output to it:
cat content.txt | freq
And even analyze text from multiple files:
cat content.txt story.txt article.txt | freq
If you are using Python 2, just replace
''.join(list(filter(args...)))
with filter(args...)
python3
with python
print(whatever)
with print whatever
Upvotes: 2
Reputation: 75820
This function lists the frequency of each word occurring in the provided file in Descending order:
function wordfrequency() {
awk '
BEGIN { FS="[^a-zA-Z]+" } {
for (i=1; i<=NF; i++) {
word = tolower($i)
words[word]++
}
}
END {
for (w in words)
printf("%3d %s\n", words[w], w)
} ' | sort -rn
}
You can call it on your file like this:
$ cat your_file.txt | wordfrequency
Source: AWK-ward Ruby
Upvotes: 7
Reputation: 17213
uniq -c already does what you want, just sort the input:
echo 'a s d s d a s d s a a d d s a s d d s a' | tr ' ' '\n' | sort | uniq -c
output:
6 a
7 d
7 s
Upvotes: 49
Reputation: 1734
Content of the input file
$ cat inputFile.txt
This is a file with many words.
Some of the words appear more than once.
Some of the words only appear one time.
Using sed | sort | uniq
$ sed 's/\.//g;s/\(.*\)/\L\1/;s/\ /\n/g' inputFile.txt | sort | uniq -c
1 a
2 appear
1 file
1 is
1 many
1 more
2 of
1 once
1 one
1 only
2 some
1 than
2 the
1 this
1 time
1 with
3 words
uniq -ic
will count and ignore case, but result list will have This
instead of this
.
Upvotes: 7
Reputation: 360335
The sort requires GNU AWK (gawk
). If you have another AWK without asort()
, this can be easily adjusted and then piped to sort
.
awk '{gsub(/\./, ""); for (i = 1; i <= NF; i++) {w = tolower($i); count[w]++; words[w] = w}} END {qty = asort(words); for (w = 1; w <= qty; w++) print words[w] "@" count[words[w]]}' inputfile
Broken out onto multiple lines:
awk '{
gsub(/\./, "");
for (i = 1; i <= NF; i++) {
w = tolower($i);
count[w]++;
words[w] = w
}
}
END {
qty = asort(words);
for (w = 1; w <= qty; w++)
print words[w] "@" count[words[w]]
}' inputfile
Upvotes: 1
Reputation: 58473
This might work for you:
tr '[:upper:]' '[:lower:]' <file |
tr -d '[:punct:]' |
tr -s ' ' '\n' |
sort |
uniq -c |
sed 's/ *\([0-9]*\) \(.*\)/\2@\1/'
Upvotes: 4