Katya Willard
Katya Willard

Reputation: 2182

SVD on TFIDF Matrix returns an odd shape

The overarching problem: I thought that running fit_transform on the model TruncatedSVD on SparseVectors from TfidfVectorizer would yield components with dimension (n_samples, n_components), as noted here (jump down to the fit_transform section).

However, I am getting back a matrix of shape (n_components, n_words).

Here is a trivial example to recreate the problem:

def build_tfidf_model(corpus):
    transformer = TfidfVectorizer(analyzer='word')
    matrix = transformer.fit_transform(corpus)
    return matrix

def svd_tfidf_matrix(matrix):
    svd = TruncatedSVD(n_components=3)
    svd.fit_transform(matrix)
    return svd.components_


corpus = ['sentence one', 'sentence two', 'another one', 'another sentence', 'two sentence', 'one sentence']
tfidf_model = build_tfidf_model(corpus)
reduced_vectors = svd_tfidf_matrix(matrix=tfidf_model)

So, tfidf_model.shape yields (6, 4). This makes sense to me. I have a corpus of six documents, which contain a total of 4 distinct words.

However, reduced_vectors.shape yields (3,4). I was expecting it to be of shape (6,3).

I must be misunderstanding what calling fit_transform is supposed to return. What can I call for SVD to get it to return a matrix where the rows are the documents and the columns are the features in the reduced space?

Upvotes: 1

Views: 1676

Answers (1)

aikramer2
aikramer2

Reputation: 1353

If you want the input represented in the transform space, then fit_transform will return that object. Currently youre calling it without assigning the result to a variable. The model.components_ attributes merely describe how you can transform the tfidf vector space to the svd space.

def svd_tfidf_matrix(matrix):
    svd = TruncatedSVD(n_components=3)
    return svd.fit_transform(matrix)

Upvotes: 1

Related Questions