Reputation: 2189
I have the following code:
train_set = ("The sky is blue.", "The sun is bright.")
test_set = ("The sun in the sky is bright.",
"We can see the shining sun, the bright sun.")
Now Im trying to calculate the word frequency like this:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
Next I would like to print the voculabary. Therefore I do:
vectorizer.fit_transform(train_set)
print vectorizer.vocabulary
Right now I get the ouput none. While I expect something like:
{'blue': 0, 'sun': 1, 'bright': 2, 'sky': 3}
Any thoughts where this goes wrong?
Upvotes: 3
Views: 10859
Reputation: 1117
CountVectorizer
doesn't support what you are looking for.
You can use the Counter
class:
from collections import Counter
train_set = ("The sky is blue.", "The sun is bright.")
word_counter = Counter()
for s in train_set:
word_counter.update(s.split())
print(word_counter)
Gives
Counter({'is': 2, 'The': 2, 'blue.': 1, 'bright.': 1, 'sky': 1, 'sun': 1})
Or you can use FreqDist
from nltk:
from nltk import FreqDist
train_set = ("The sky is blue.", "The sun is bright.")
word_dist = FreqDist()
for s in train_set:
word_dist.update(s.split())
print(dict(word_dist))
Gives
{'blue.': 1, 'bright.': 1, 'is': 2, 'sky': 1, 'sun': 1, 'The': 2}
Upvotes: 4
Reputation: 1136
I think you can try this:
print vectorizer.vocabulary_
Upvotes: 5