Lodore66
Lodore66

Reputation: 1185

CountVectorizer giving wrong counts for words?

Let's say my text file consists of the following text:

The quick brown fox jumped over the lazy dogs. A stitch in time saves nine. The quick brown stitch jumped over the lazy time. The fox in time saves a dog.

I want to use sk-learn's CountVectorizer to get a word count for all words in the file. (I know there are other ways to do this, but I want to use CountVectorizer for a few reasons.) This is my code:

from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer

text = input('Please enter the filepath for the text: ') 
text = open(text, 'r', encoding = 'utf-8')
tokens = CountVectorizer(analyzer = 'word', stop_words = 'english')


X = tokens.fit_transform(text)
dictionary = tokens.vocabulary_

Except that when I call dictionary, it gives me the wrong counts:

>>> dictionary
{'time': 9, 'dog': 1, 'stitch': 8, 'quick': 6, 'lazy': 5, 'brown': 0, 'saves': 7, 'jumped': 4, 'fox': 3, 'dogs': 2}

Can anyone advise on the (doubtless obvious) mistake I'm making here?

Upvotes: 2

Views: 2711

Answers (1)

Moses Koledoye
Moses Koledoye

Reputation: 78556

vocabulary_ is a dict/mapping of the terms to their indices in the document-term matrix, not the counts:

vocabulary_ : A mapping of terms to feature indices.

X is what actually gives you the matrix of feature indices and corresponding counts.

>>> for i in X:
...    print(i)
... 
  (0, 1)    1
  (0, 7)    2
  (0, 9)    3
  (0, 8)    2
  (0, 2)    1
  (0, 5)    2
  (0, 4)    2
  (0, 3)    2
  (0, 0)    2
  (0, 6)    2

e.g. 9 -> 'time' has a count of 3.

Upvotes: 6

Related Questions