Reputation: 3154
I have a set of documents (stored as .txt
files). I Also have a python dictionary of some selected words. I want to assign tf-idf scores only to these words, and not all words, from the set of documents. How can this be done using scikit-learn
or any other library ?
I have referred to this blog post but it gives scores of full vocabulary.
Upvotes: 3
Views: 1196
Reputation: 9136
You can do it with CountVectorizer
, which scans the document as text and converts into a term-document matrix, and using TfidfTrasnformer
on the matrix.
These two steps can also be combined and done together with the TfidfVectorizer
.
These are in the sklearn.feature_extraction.text
module [link].
Both processes will return the same sparse matrix representation, on which I presume you will probably do SVD transform by TruncatedSVD
to get a smaller dense matrix.
You can also of course do it yourself, which requires keeping two maps, one for each document, and one overall, where you count the terms. That is how they operate under the hood.
This page has some nice examples.
Upvotes: 1