Get total count of aword in corpus using Countvectorizer

Question

I have corpus of the following format:

corpus = ['text_1', 'text_2', ... . 'text_4280']

In total there are 90141 unique words. For each word, I want to calculate the total number of times it appears in corpus.

To do so, I used:

vectorizer = CountVectorizer(corpus)

Currently, the only way I am aware of doing this is by:

vectorizer.fit_transform()

However, this will create a (sparse) Numpy array with shape (4280, 90141). Does CountVectorizer has more memory-efficient approaches to get all the column sums of the document-term matrix?

Hammad Ahmed · Accepted Answer

you could use

vectorizer.fit_transform().toarray().sum(axis= 0)

EDIT

my bad, you should just remove .toarray() from the above statement. I didn't realise that you could call .sum() on a sparse array

vectorizer.fit_transform().sum(axis= 0)

Get total count of aword in corpus using Countvectorizer

Answers (1)

Related Questions