Reputation: 310
I have to work with pre-tokenized documents which I can load into a list
of strings. I want to use scikit's CountVectorizer to calculate document-term matrices for them. Is this possible?
Or should I manually construct / calculate a docterm matrix myself?
The reason I want to use scikit for this, is that the above needs to be integrated into a program that's trained with scikits CountVectorizer and BinomialNB.
Upvotes: 0
Views: 967
Reputation: 23
In the following code, text_list is the "list of lists" in other words text_list = [[doc1],[doc2],...,[docn]]. You can obtain a sparse matrix containing terms and their frequencies of each document in your corpus.
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()
TermCountsDoc = count_vect.fit_transform(text_list)
Terms = np.array(count_vect.vocabulary_.keys())
T= TermCountsDoc.todense() #in case you need to transform it to dense matrix
Upvotes: 1