user3481246
user3481246

Reputation: 61

nltk function to count occurrences of certain words

In the nltk book there is the question "Read in the texts of the State of the Union addresses, using the state_union corpus reader. Count occurrences of men, women, and people in each document. What has happened to the usage of these words over time?"

I thought I could use a function like state_union('1945-Truman.txt').count('men') However, there are over 60 texts in this State Union corpa and I feel like there has to be an easier way to see the count of these words for each one instead of repeating this function over and over for each text.

Upvotes: 6

Views: 7121

Answers (1)

alvas
alvas

Reputation: 122112

You can use the .words() function in the corpus to returns a list of strings (i.e. tokens/words):

>>> from nltk.corpus import brown
>>> brown.words()
[u'The', u'Fulton', u'County', u'Grand', u'Jury', ...]

Then use the Counter() object to count the instances, see https://docs.python.org/2/library/collections.html#collections.Counter:

>>> wordcounts = Counter(brown.words())

But do note that the Counter is case-sensitive, see:

>>> from nltk.corpus import brown
>>> from collections import Counter
>>> brown.words()
[u'The', u'Fulton', u'County', u'Grand', u'Jury', ...]
>>> wordcounts = Counter(brown.words())
>>> wordcounts['the']
62713
>>> wordcounts['The']
7258
>>> wordcounts_lower = Counter(i.lower() for i in brown.words())
>>> wordcounts_lower['The']
0
>>> wordcounts_lower['the']
69971

Upvotes: 4

Related Questions