summer_ZUGG
summer_ZUGG

Reputation: 31

Why is this CountVectorizer output different from my word counts?

I have a dataframe with a column called 'Phrase'. I used the following code to find the 20 most common words in this column:

print(pd.Series(' '.join(film['Phrase']).lower().split()).value_counts()[:20])

This gave me the following output:

s             16981
film           6689
movie          5905
nt             3970
one            3609
like           3071
story          2520
rrb            2438
lrb            2098
good           2043
characters     1882
much           1862
time           1747
comedy         1721
even           1597
little         1575
funny          1522
way            1511
life           1484
make           1396

I later needed to create vector counts for each word. I tried to do so using the following code:

vectorizer = CountVectorizer()
vectorizer.fit(film['Phrase'])
print(vectorizer.vocabulary_)

I won't show the whole output, but the output numbers are different from the output above. For example for the word 'movie' it is 9308, for 'good' it is 6131 and for 'make' it is 8655. Why is this happening? Is the value counts method just counting every column that uses the word rather than counting every occurrence of the word? Have I misunderstood what CountVectorizer object is doing?

Upvotes: 0

Views: 2368

Answers (2)

Venkatachalam
Venkatachalam

Reputation: 16966

As mentioned by @MaximeKan, CountVectorizer() does not compute the frequency of each term but we can compute it from the sparse matrix output of transform() and get_feature_names() attribute of vectorizer.

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(film['Phrase'])
{x:y for x,y in zip(vectorizer.get_feature_names(), X.sum(0).getA1())}

Working example:

>>> from sklearn.feature_extraction.text import CountVectorizer
>>> corpus = [
...     'This is the first document.',
...     'This document is the second document.',
...     'And this is the third one.',
...     'Is this the first document?',
... ]
>>> vectorizer = CountVectorizer()
>>> X = vectorizer.fit_transform(corpus)

Do not use .toarray() until it is necessary because it requires more memory size and computation time. we can get the sum using sparse matrix directly.

>>> list(zip(vectorizer.get_feature_names(), X.sum(0).getA1()))

[('and', 1),
 ('document', 4),
 ('first', 2),
 ('is', 4),
 ('one', 1),
 ('second', 1),
 ('the', 4),
 ('third', 1),
 ('this', 4)]

Upvotes: 3

MaximeKan
MaximeKan

Reputation: 4221

vectorizer.vocabulary_ does not return word frequencies, but according to the documentation:

A mapping of terms to feature indices

What this means is that each of the words in your data gets mapped to an index, which is stored in vectorizer.vocabulary_.

Here is an example to illustrate what is happening:

from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd

df = pd.DataFrame({"a":["we love music","we love piano"]})

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(df['a'])
print(vectorizer.vocabulary_)

>>> {'we': 3, 'love': 0, 'music': 1, 'piano': 2}

This vectorization identifies 4 words in the data, and assigns indices from 0 to 3 to each word. Now, you might ask: "But why do I even care about these indices?" Because once the vectorization is done, you need to keep track of the order of the words in your vectorized object. For instance,

X.toarray()
>>> array([[1, 1, 0, 1],
           [1, 0, 1, 1]], dtype=int64)

Using the vocabulary dictionary, you can hence tell that the first column corresponds to "love", the second to "music", the third to "piano" and the fourth to "we".

Note, this also corresponds to the order of the words in vectorizer.get_feature_names()

vectorizer.get_feature_names()
>>> ['love', 'music', 'piano', 'we']

Upvotes: 3

Related Questions