Reputation: 2182
The overarching problem:
I thought that running fit_transform
on the model TruncatedSVD
on SparseVectors
from TfidfVectorizer
would yield components with dimension (n_samples, n_components), as noted here (jump down to the fit_transform
section).
However, I am getting back a matrix of shape (n_components, n_words).
Here is a trivial example to recreate the problem:
def build_tfidf_model(corpus):
transformer = TfidfVectorizer(analyzer='word')
matrix = transformer.fit_transform(corpus)
return matrix
def svd_tfidf_matrix(matrix):
svd = TruncatedSVD(n_components=3)
svd.fit_transform(matrix)
return svd.components_
corpus = ['sentence one', 'sentence two', 'another one', 'another sentence', 'two sentence', 'one sentence']
tfidf_model = build_tfidf_model(corpus)
reduced_vectors = svd_tfidf_matrix(matrix=tfidf_model)
So, tfidf_model.shape
yields (6, 4)
. This makes sense to me. I have a corpus of six documents, which contain a total of 4 distinct words.
However, reduced_vectors.shape
yields (3,4)
. I was expecting it to be of shape (6,3)
.
I must be misunderstanding what calling fit_transform
is supposed to return. What can I call for SVD
to get it to return a matrix where the rows are the documents and the columns are the features in the reduced space?
Upvotes: 1
Views: 1676
Reputation: 1353
If you want the input represented in the transform space, then fit_transform will return that object. Currently youre calling it without assigning the result to a variable. The model.components_ attributes merely describe how you can transform the tfidf vector space to the svd space.
def svd_tfidf_matrix(matrix):
svd = TruncatedSVD(n_components=3)
return svd.fit_transform(matrix)
Upvotes: 1