Implementing idf with nltk

Question

Given the sentence: "the quick brown fox jumped over the lazy dog", I would like to get a score of how frequent each word is from an nltk corpus (which ever corpus is most generic/comprehensive)

EDIT:

This question is in relation to this question: python nltk keyword extraction from sentence where @adi92 suggested using the technique of idf to calculate the 'rareness' of a word. I would like to see what this would look like in practice. The broader problem here is, how do you calculate the rareness of a word's use in the english language. I appreciate that this is a hard problem to solve, but nonetheless nltk idf (with something like the brown or reuters corpus??) might get us part of the way there?

alexis · Accepted Answer

If you want to know word frequencies you need a table of word frequencies. Words have different frequencies depending on text genre, so the best frequency table might be based on a domain-specific corpus.

If you're just messing around, it's easy enough to pick a corpus at random and count the words-- use .words() and the nltk's FreqDist, and/or see the NLTK book for details.

But for serious use, don't bother counting words yourself: If you're not interested in a specific domain, grab a large word frequency table. There are gazillions out there (it's evidently the first thing a corpus creator thinks of), and the largest one is probably the "1-gram" tables compiled by google. You can download them at http://books.google.com/ngrams/datasets

Implementing idf with nltk

Answers (1)

Related Questions