Reputation: 371
The Python package nltk
has the FreqDist function which gives you the frequency of words within a text. I am trying to pass my text as an argument but the result is of the form:
[' ', 'e', 'a', 'o', 'n', 'i', 't', 'r', 's', 'l', 'd', 'h', 'c', 'y', 'b', 'u', 'g', '\n', 'm', 'p', 'w', 'f', ',', 'v', '.', "'", 'k', 'B', '"', 'M', 'H', '9', 'C', '-', 'N', 'S', '1', 'A', 'G', 'P', 'T', 'W', '[', ']', '(', ')', '0', '7', 'E', 'J', 'O', 'R', 'j', 'x']
whereas in the example on the nltk
website, the result was whole words not characters. Here is how I am currently using the function:
file_y = open(fileurl)
p = file_y.read()
fdist = FreqDist(p)
vocab = fdist.keys()
vocab[:100]
What I am doing wrong?
Upvotes: 35
Views: 114884
Reputation: 21
text_dist = nltk.FreqDist(word for word in list(text) if word.isalpha())
top1_text1 = text_dist.max()
maxfreq = top1_text1
Upvotes: 0
Reputation: 27
Your_string = "here is my string"
tokens = Your_string.split()
Do this way, and then use the NLTK functions
it will give your tokens in words but not in characters
Upvotes: 1
Reputation: 3871
You simply have to use it like this:
import nltk
from nltk.probability import FreqDist
sentence='''This is my sentence'''
tokens = nltk.tokenize.word_tokenize(sentence)
fdist=FreqDist(tokens)
The variable fdist is of the type "class 'nltk.probability.FreqDist" and contains the frequency distribution of words.
Upvotes: 9
Reputation: 18385
NLTK's FreqDist
accepts any iterable. As a string is iterated character by character, it is pulling things apart in the way that you're experiencing.
In order to do count words, you need to feed FreqDist
words. How do you do that? Well, you might think (as others have suggested in the answer to your question) to feed the whole file to nltk.tokenize.word_tokenize
.
>>> # first, let's import the dependencies
>>> import nltk
>>> from nltk.probability import FreqDist
>>> # wrong :(
>>> words = nltk.tokenize.word_tokenize(p)
>>> fdist = FreqDist(words)
word_tokenize
builds word models from sentences. It needs to be fed each sentence one at a time. It will do a relatively poor job when given whole paragraphs or even documents.
So, what to do? Easy, add in a sentence tokenizer!
>>> fdist = FreqDist()
>>> for sentence in nltk.tokenize.sent_tokenize(p):
... for word in nltk.tokenize.word_tokenize(sentence):
>>> fdist[word] += 1
One thing to bear in mind is that there are many ways to tokenize text. The modules nltk.tokenize.sent_tokenize
and nltk.tokenize.word_tokenize
simply pick a reasonable default for relatively clean, English text. There are several other options to chose from, which you can read about in the API documentation.
Upvotes: 23
Reputation: 8986
FreqDist runs on an array of tokens. You're sending it a an array of characters (a string) where you should have tokenized the input first:
words = nltk.tokenize.word_tokenize(p)
fdist = FreqDist(words)
Upvotes: 33
Reputation: 11744
FreqDist
expects an iterable of tokens. A string is iterable --- the iterator yields every character.
Pass your text to a tokenizer first, and pass the tokens to FreqDist
.
Upvotes: 52