afg102
afg102

Reputation: 371

FreqDist with NLTK

The Python package nltk has the FreqDist function which gives you the frequency of words within a text. I am trying to pass my text as an argument but the result is of the form:

[' ', 'e', 'a', 'o', 'n', 'i', 't', 'r', 's', 'l', 'd', 'h', 'c', 'y', 'b', 'u', 'g', '\n', 'm', 'p', 'w', 'f', ',', 'v', '.', "'", 'k', 'B', '"', 'M', 'H', '9', 'C', '-', 'N', 'S', '1', 'A', 'G', 'P', 'T', 'W', '[', ']', '(', ')', '0', '7', 'E', 'J', 'O', 'R', 'j', 'x']

whereas in the example on the nltk website, the result was whole words not characters. Here is how I am currently using the function:

file_y = open(fileurl)
p = file_y.read()
fdist = FreqDist(p)
vocab = fdist.keys()
vocab[:100]

What I am doing wrong?

Upvotes: 35

Views: 114884

Answers (6)

Pranab Bijoypuri
Pranab Bijoypuri

Reputation: 21

text_dist = nltk.FreqDist(word for word in list(text) if word.isalpha())
top1_text1 = text_dist.max()
maxfreq = top1_text1

Upvotes: 0

Musadiq
Musadiq

Reputation: 27

Your_string = "here is my string"
tokens = Your_string.split()

Do this way, and then use the NLTK functions

it will give your tokens in words but not in characters

Upvotes: 1

Aakash Anuj
Aakash Anuj

Reputation: 3871

You simply have to use it like this:

import nltk
from nltk.probability import FreqDist

sentence='''This is my sentence'''
tokens = nltk.tokenize.word_tokenize(sentence)
fdist=FreqDist(tokens)

The variable fdist is of the type "class 'nltk.probability.FreqDist" and contains the frequency distribution of words.

Upvotes: 9

Tim McNamara
Tim McNamara

Reputation: 18385

NLTK's FreqDist accepts any iterable. As a string is iterated character by character, it is pulling things apart in the way that you're experiencing.

In order to do count words, you need to feed FreqDist words. How do you do that? Well, you might think (as others have suggested in the answer to your question) to feed the whole file to nltk.tokenize.word_tokenize.

>>> # first, let's import the dependencies
>>> import nltk
>>> from nltk.probability import FreqDist

>>> # wrong :(
>>> words = nltk.tokenize.word_tokenize(p)
>>> fdist = FreqDist(words)

word_tokenize builds word models from sentences. It needs to be fed each sentence one at a time. It will do a relatively poor job when given whole paragraphs or even documents.

So, what to do? Easy, add in a sentence tokenizer!

>>> fdist = FreqDist()
>>> for sentence in nltk.tokenize.sent_tokenize(p):
...     for word in nltk.tokenize.word_tokenize(sentence):
>>>         fdist[word] += 1

One thing to bear in mind is that there are many ways to tokenize text. The modules nltk.tokenize.sent_tokenize and nltk.tokenize.word_tokenize simply pick a reasonable default for relatively clean, English text. There are several other options to chose from, which you can read about in the API documentation.

Upvotes: 23

Eran Kampf
Eran Kampf

Reputation: 8986

FreqDist runs on an array of tokens. You're sending it a an array of characters (a string) where you should have tokenized the input first:

words = nltk.tokenize.word_tokenize(p)
fdist = FreqDist(words)

Upvotes: 33

Alex Brasetvik
Alex Brasetvik

Reputation: 11744

FreqDist expects an iterable of tokens. A string is iterable --- the iterator yields every character.

Pass your text to a tokenizer first, and pass the tokens to FreqDist.

Upvotes: 52

Related Questions