How to have scikit calculate document-term matrix from pre-tokenized documents?

I have to work with pre-tokenized documents which I can load into a list of strings. I want to use scikit's CountVectorizer to calculate document-term matrices for them. Is this possible?

Or should I manually construct / calculate a docterm matrix myself?

The reason I want to use scikit for this, is that the above needs to be integrated into a program that's trained with scikits CountVectorizer and BinomialNB.

Upvotes: 0

Answers (1)

Tannaz Khaleghi

Reputation: 23

In the following code, text_list is the "list of lists" in other words text_list = [[doc1],[doc2],...,[docn]]. You can obtain a sparse matrix containing terms and their frequencies of each document in your corpus.

from sklearn.feature_extraction.text import CountVectorizer

count_vect = CountVectorizer()
TermCountsDoc = count_vect.fit_transform(text_list)   
Terms = np.array(count_vect.vocabulary_.keys())
T= TermCountsDoc.todense() #in case you need to transform it to dense matrix

Upvotes: 1

How to have scikit calculate document-term matrix from pre-tokenized documents?

Answers (1)

Related Questions