Village
Village

Reputation: 24423

How to create a frequency list of every word in a file?

I have a file like this:

This is a file with many words.
Some of the words appear more than once.
Some of the words only appear one time.

I would like to generate a two-column list. The first column shows what words appear, the second column shows how often they appear, for example:

this@1
is@1
a@1
file@1
with@1
many@1
words3
some@2
of@2
the@2
only@1
appear@2
more@1
than@1
one@1
once@1
time@1 

So far, I have this:

sed -i "s/ /\n/g" ./file1.txt # put all words on a new line
while read line
do
     count="$(grep -c $line file1.txt)"
     echo $line"@"$count >> file2.txt # add word and frequency to file
done < ./file1.txt
sort -u -d # remove duplicate lines

For some reason, this is only showing "0" after each word.

How can I generate a list of every word that appears in a file, along with frequency information?

Upvotes: 57

Views: 84765

Answers (13)

Frank Xu
Frank Xu

Reputation: 81

grep -Eio "\w+" test.txt | sort | uniq -c | sort -nr

-E: extended regular expression
-i: ignore upper/lower case
-o: only outputs the match pattern

"\w": [a-zA-Z0-9_]
+: repeat the preceding character 1 or more times
sort: sort the word (alphabetic)
uniq -c: count unique words
sort -n: sort by word frequence

enter image description here

Upvotes: 2

eduffy
eduffy

Reputation: 40232

Not sed and grep, but tr, sort, uniq, and awk:

% (tr ' ' '\n' | sort | uniq -c | awk '{print $2"@"$1}') <<EOF
This is a file with many words.
Some of the words appear more than once.
Some of the words only appear one time.
EOF

a@1
appear@2
file@1
is@1
many@1
more@1
of@2
once.@1
one@1
only@1
Some@2
than@1
the@2
This@1
time.@1
with@1
words@2
words.@1

In most cases you also want to remove numbers and punctuation, convert everything to lowercase (otherwise "THE", "The" and "the" are counted separately) and suppress an entry for a zero length word. For ASCII text you can do all these with this modified command:

sed -e  's/[^A-Za-z]/ /g' text.txt | tr 'A-Z' 'a-z' | tr ' ' '\n' | grep -v '^$'| sort | uniq -c | sort -rn

Upvotes: 84

Jan Bodnar
Jan Bodnar

Reputation: 11647

This is a bit more complex task. We need to take at least the following into the account:

  • removing punctuation; sky is different from sky. or sky?
  • Earth is different from earth, god from God, moon from Moon, but The and the are considered the same. So it is questionable whether to lowercase the words or not.
  • we must take the BOM character into account
$ file the-king-james-bible.txt 
the-king-james-bible.txt: UTF-8 Unicode (with BOM) text

The BOM is the first metacharacter in the file. If not removed, it might incorrectly affect one word.

The following is a solution with AWK.

    {  

        if (NR == 1) { 
            sub(/^\xef\xbb\xbf/,"")
        }

        gsub(/[,;!()*:?.]*/, "")
    
        for (i = 1; i <= NF; i++) {
    
            if ($i ~ /^[0-9]/) { 
                continue
            }
    
            w = $i
            words[w]++
        }
    } 
    
    END {
    
        for (idx in words) {
    
            print idx, words[idx]
        }
    }

It removes the BOM character and replaces punctuation characters. It does not lowercase the words. In addition, since the program was used to count the words of the Bible, it skips all verses (the if condition with continue).

$ awk -f word_freq.awk the-king-james-bible.txt > bible_words.txt

We run the program and write the output into a file.

$ sort -nr -k 2 bible_words.txt | head
the 62103
and 38848
of 34478
to 13400
And 12846
that 12576
in 12331
shall 9760
he 9665
unto 8942

With sort and head, we find the top ten most frequent words in the Bible.

Upvotes: 1

user12578371
user12578371

Reputation:

If I have the following text in my file.txt.

This is line number one
This is Line Number Tow
this is Line Number tow

I can find the frequency of each word using the following cmd.

 cat file.txt | tr ' ' '\n' | sort | uniq -c

output :

  3 is
  1 line
  2 Line
  1 number
  2 Number
  1 one
  1 this
  2 This
  1 tow
  1 Tow

Upvotes: 3

Dani Konoplya
Dani Konoplya

Reputation: 109

  awk '{ 
       BEGIN{word[""]=0;}
    {
    for (el =1 ; el <= NF ; ++el) {word[$el]++ }
    }
 END {
 for (i in word) {
        if (i !="") 
           {
              print word[i],i;
           }
                 }
 }' file.txt | sort -nr

Upvotes: -1

Jerin A Mathews
Jerin A Mathews

Reputation: 8712

You can use tr for this, just run

tr ' ' '\12' <NAME_OF_FILE| sort | uniq -c | sort -nr > result.txt

Sample Output for a text file of city names:

3026 Toronto
2006 Montréal
1117 Edmonton
1048 Calgary
905 Ottawa
724 Winnipeg
673 Vancouver
495 Brampton
489 Mississauga
482 London
467 Hamilton

Upvotes: 14

GL2014
GL2014

Reputation: 6694

#!/usr/bin/env bash

declare -A map 
words="$1"

[[ -f $1 ]] || { echo "usage: $(basename $0 wordfile)"; exit 1 ;}

while read line; do 
  for word in $line; do 
    ((map[$word]++))
  done; 
done < <(cat $words )

for key in ${!map[@]}; do 
  echo "the word $key appears ${map[$key]} times"
done|sort -nr -k5

Upvotes: 0

John Red
John Red

Reputation: 717

Let's do it in Python 3!

"""Counts the frequency of each word in the given text; words are defined as
entities separated by whitespaces; punctuations and other symbols are ignored;
case-insensitive; input can be passed through stdin or through a file specified
as an argument; prints highest frequency words first"""

# Case-insensitive
# Ignore punctuations `~!@#$%^&*()_-+={}[]\|:;"'<>,.?/

import sys

# Find if input is being given through stdin or from a file
lines = None
if len(sys.argv) == 1:
    lines = sys.stdin
else:
    lines = open(sys.argv[1])

D = {}
for line in lines:
    for word in line.split():
        word = ''.join(list(filter(
            lambda ch: ch not in "`~!@#$%^&*()_-+={}[]\\|:;\"'<>,.?/",
            word)))
        word = word.lower()
        if word in D:
            D[word] += 1
        else:
            D[word] = 1

for word in sorted(D, key=D.get, reverse=True):
    print(word + ' ' + str(D[word]))

Let's name this script "frequency.py" and add a line to "~/.bash_aliases":

alias freq="python3 /path/to/frequency.py"

Now to find the frequency words in your file "content.txt", you do:

freq content.txt

You can also pipe output to it:

cat content.txt | freq

And even analyze text from multiple files:

cat content.txt story.txt article.txt | freq

If you are using Python 2, just replace

  • ''.join(list(filter(args...))) with filter(args...)
  • python3 with python
  • print(whatever) with print whatever

Upvotes: 2

Sheharyar
Sheharyar

Reputation: 75820

Let's use AWK!

This function lists the frequency of each word occurring in the provided file in Descending order:

function wordfrequency() {
  awk '
     BEGIN { FS="[^a-zA-Z]+" } {
         for (i=1; i<=NF; i++) {
             word = tolower($i)
             words[word]++
         }
     }
     END {
         for (w in words)
              printf("%3d %s\n", words[w], w)
     } ' | sort -rn
}

You can call it on your file like this:

$ cat your_file.txt | wordfrequency

Source: AWK-ward Ruby

Upvotes: 7

Bohdan
Bohdan

Reputation: 17213

uniq -c already does what you want, just sort the input:

echo 'a s d s d a s d s a a d d s a s d d s a' | tr ' ' '\n' | sort | uniq -c

output:

  6 a
  7 d
  7 s

Upvotes: 49

Rony
Rony

Reputation: 1734

Content of the input file

$ cat inputFile.txt
This is a file with many words.
Some of the words appear more than once.
Some of the words only appear one time.

Using sed | sort | uniq

$ sed 's/\.//g;s/\(.*\)/\L\1/;s/\ /\n/g' inputFile.txt | sort | uniq -c
      1 a
      2 appear
      1 file
      1 is
      1 many
      1 more
      2 of
      1 once
      1 one
      1 only
      2 some
      1 than
      2 the
      1 this
      1 time
      1 with
      3 words

uniq -ic will count and ignore case, but result list will have This instead of this.

Upvotes: 7

Dennis Williamson
Dennis Williamson

Reputation: 360335

The sort requires GNU AWK (gawk). If you have another AWK without asort(), this can be easily adjusted and then piped to sort.

awk '{gsub(/\./, ""); for (i = 1; i <= NF; i++) {w = tolower($i); count[w]++; words[w] = w}} END {qty = asort(words); for (w = 1; w <= qty; w++) print words[w] "@" count[words[w]]}' inputfile

Broken out onto multiple lines:

awk '{
    gsub(/\./, ""); 
    for (i = 1; i <= NF; i++) {
        w = tolower($i); 
        count[w]++; 
        words[w] = w
    }
} 
END {
    qty = asort(words); 
    for (w = 1; w <= qty; w++)
        print words[w] "@" count[words[w]]
}' inputfile

Upvotes: 1

potong
potong

Reputation: 58473

This might work for you:

tr '[:upper:]' '[:lower:]' <file |
tr -d '[:punct:]' |
tr -s ' ' '\n' | 
sort |
uniq -c |
sed 's/ *\([0-9]*\) \(.*\)/\2@\1/'

Upvotes: 4

Related Questions