frooty
frooty

Reputation: 69

To find frequency of every word in text file in python

I want to find frequency of all words in my text file so that i can find out most frequently occuring words from them. Can someone please help me the command to be used for that.

import nltk
text1 = "hello he heloo hello hi " // example text
 fdist1 = FreqDist(text1) 

I have used above code but problem is that it is not giving word frequency,rather it is displaying frequency of every character. Also i want to know how to input text using text file.

Upvotes: 2

Views: 10658

Answers (5)

ramya yogesh
ramya yogesh

Reputation: 1

I think below code is useful for you to get the frequency of each word in the file in dictionary form

myfile=open('greet.txt')
temp=myfile.read()
x=temp.split("\n")
y=list()
for item in x:
   z=item.split(" ")
   y.append(z)
count=dict()
for name in y:
   for items in name:
       if items not in count:`enter code here`
        count[items]=1
      else:
        count[items]=count[items]+1
print(count)

Upvotes: 0

Boa
Boa

Reputation: 2677

For what it's worth, NLTK seems like overkill for this task. The following will give you word frequencies, in order from highest to lowest.

from collections import Counter
input_string = [...] # get the input from a file
word_freqs = Counter(input_string.split())

Upvotes: 4

Dibin Joseph
Dibin Joseph

Reputation: 261

In order to have the frequency as well as the words as a dictionary, the following code will be beneficial:

import nltk
from nltk.tokenize import word_tokenize  

for f in word_tokenize(inputSentence):  
     dict[f] = fre[f]                                                  

print dict

Upvotes: 1

jfs
jfs

Reputation: 414795

text1 in the nltk book is a collection of tokens (words, punctuation) unlike in your code example where text1 is a string (collection of Unicode codepoints):

>>> from nltk.book import text1
>>> text1
<Text: Moby Dick by Herman Melville 1851>
>>> text1[99] # 100th token in the text
','
>>> from nltk import FreqDist
>>> FreqDist(text1)
FreqDist({',': 18713, 'the': 13721, '.': 6862, 'of': 6536, 'and': 6024,
          'a': 4569, 'to': 4542, ';': 4072, 'in': 3916, 'that': 2982, ...})

If your input is indeed space-separated words then to find the frequency, use @Boa's answer:

freq = Counter(text_with_space_separated_words.split())

Note: FreqDist is a Counter but it also defines additional methods such as .plot().

If you want to use nltk tokenizers instead:

#!/usr/bin/env python3
from itertools import chain
from nltk import FreqDist, sent_tokenize, word_tokenize # $ pip install nltk

with open('your_text.txt') as file:
    text = file.read()
words = chain.from_iterable(map(word_tokenize, sent_tokenize(text)))
freq = FreqDist(map(str.casefold, words))
freq.pprint()
# -> FreqDist({'hello': 2, 'hi': 1, 'heloo': 1, 'he': 1})

sent_tokenize() tokenizes the text into sentences. Then word_tokenize tokenizes each sentence into words. There are many ways to tokenize text in nltk.

Upvotes: 2

heinst
heinst

Reputation: 8786

I saw you were using the example and saw the same thing you were seeing, in order for it to work properly, you have to split the string by spaces. If you do not do this, it seems to count each character, which is what you were seeing. This returns the proper counts of each word, not character.

import nltk

text1 = 'hello he heloo hello hi '
text1 = text1.split(' ')
fdist1 = nltk.FreqDist(text1)
print (fdist1.most_common(50))

If you want to read from a file and get the word count, you can do it like so:

input.txt

hello he heloo hello hi
my username is heinst
your username is frooty

python code

import nltk

with open ("input.txt", "r") as myfile:
    data=myfile.read().replace('\n', ' ')

data = data.split(' ')
fdist1 = nltk.FreqDist(data)
print (fdist1.most_common(50))

Upvotes: 5

Related Questions