Reputation: 1722
I have corpus of the following format:
corpus = ['text_1', 'text_2', ... . 'text_4280']
In total there are 90141 unique words.
For each word, I want to calculate the total number of times it appears in corpus
.
To do so, I used:
vectorizer = CountVectorizer(corpus)
Currently, the only way I am aware of doing this is by:
vectorizer.fit_transform()
However, this will create a (sparse) Numpy array with shape (4280, 90141)
. Does CountVectorizer has more memory-efficient approaches to get all the column sums of the document-term matrix?
Upvotes: 1
Views: 672
Reputation: 885
you could use
vectorizer.fit_transform().toarray().sum(axis= 0)
EDIT
my bad, you should just remove .toarray()
from the above statement. I didn't realise that you could call .sum()
on a sparse array
vectorizer.fit_transform().sum(axis= 0)
Upvotes: 1