Reputation: 31
I have a dataframe with a column called 'Phrase'. I used the following code to find the 20 most common words in this column:
print(pd.Series(' '.join(film['Phrase']).lower().split()).value_counts()[:20])
This gave me the following output:
s 16981
film 6689
movie 5905
nt 3970
one 3609
like 3071
story 2520
rrb 2438
lrb 2098
good 2043
characters 1882
much 1862
time 1747
comedy 1721
even 1597
little 1575
funny 1522
way 1511
life 1484
make 1396
I later needed to create vector counts for each word. I tried to do so using the following code:
vectorizer = CountVectorizer()
vectorizer.fit(film['Phrase'])
print(vectorizer.vocabulary_)
I won't show the whole output, but the output numbers are different from the output above. For example for the word 'movie' it is 9308, for 'good' it is 6131 and for 'make' it is 8655. Why is this happening? Is the value counts method just counting every column that uses the word rather than counting every occurrence of the word? Have I misunderstood what CountVectorizer object is doing?
Upvotes: 0
Views: 2368
Reputation: 16966
As mentioned by @MaximeKan, CountVectorizer()
does not compute the frequency of each term but we can compute it from the sparse matrix output of transform() and get_feature_names()
attribute of vectorizer
.
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(film['Phrase'])
{x:y for x,y in zip(vectorizer.get_feature_names(), X.sum(0).getA1())}
Working example:
>>> from sklearn.feature_extraction.text import CountVectorizer
>>> corpus = [
... 'This is the first document.',
... 'This document is the second document.',
... 'And this is the third one.',
... 'Is this the first document?',
... ]
>>> vectorizer = CountVectorizer()
>>> X = vectorizer.fit_transform(corpus)
Do not use .toarray()
until it is necessary because it requires more memory size and computation time.
we can get the sum using sparse matrix directly.
>>> list(zip(vectorizer.get_feature_names(), X.sum(0).getA1()))
[('and', 1),
('document', 4),
('first', 2),
('is', 4),
('one', 1),
('second', 1),
('the', 4),
('third', 1),
('this', 4)]
Upvotes: 3
Reputation: 4221
vectorizer.vocabulary_
does not return word frequencies, but according to the documentation:
A mapping of terms to feature indices
What this means is that each of the words in your data gets mapped to an index, which is stored in vectorizer.vocabulary_
.
Here is an example to illustrate what is happening:
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd
df = pd.DataFrame({"a":["we love music","we love piano"]})
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(df['a'])
print(vectorizer.vocabulary_)
>>> {'we': 3, 'love': 0, 'music': 1, 'piano': 2}
This vectorization identifies 4 words in the data, and assigns indices from 0 to 3 to each word. Now, you might ask: "But why do I even care about these indices?" Because once the vectorization is done, you need to keep track of the order of the words in your vectorized object. For instance,
X.toarray()
>>> array([[1, 1, 0, 1],
[1, 0, 1, 1]], dtype=int64)
Using the vocabulary dictionary, you can hence tell that the first column corresponds to "love", the second to "music", the third to "piano" and the fourth to "we".
Note, this also corresponds to the order of the words in vectorizer.get_feature_names()
vectorizer.get_feature_names()
>>> ['love', 'music', 'piano', 'we']
Upvotes: 3