RedShift
RedShift

Reputation: 273

Understanding the matrix output of Tfidfvectorizer in Sklearn

I'm having trouble interpreting the matrix output for the Tfidf vectorizer.

Given

vectorizer = TfidfVectorizer(max_df=0.5, max_features=10000,
                         min_df=2, stop_words='english',
                         use_idf=True)


X_train_tfidf = vectorizer.fit_transform(X_train_raw)

If I were to look at the output of X_train_tfidf, am I looking at a matrix that is structured like:

Column 1 corresponds to document 1 where its elements are tfidf scores of the 10000 features, Column 2 corresponds to document 2... and so on?

Upvotes: 7

Views: 10664

Answers (1)

BassFaceIV
BassFaceIV

Reputation: 142

Assuming you're seeing output similar to this:

(0, 18)       0.424688479366
(0, 6)        0.424688479366
(0, 4)        0.424688479366
(0, 14)       0.239262081323
(0, 17)       0.202366335916
(0, 5)        0.424688479366
(0, 1)        0.424688479366
(1, 17)       0.184426607226
(1, 8)        0.387039944282
(1, 15)       0.387039944282
(1, 0)        0.387039944282
(1, 2)        0.387039944282
(1, 13)       0.387039944282
(1, 7)        0.387039944282
(1, 11)       0.259205161463
(2, 14)       0.313686744222
(2, 17)       0.530628478217
(2, 9)        0.556791722552
(2, 16)       0.556791722552
(3, 14)       0.346483013718
(3, 17)       0.293053113789
(3, 11)       0.411875926253
(3, 10)       0.61500486583
(3, 3)        0.496182053366
(4, 14)       0.346483013718
(4, 17)       0.293053113789
(4, 11)       0.411875926253
(4, 3)        0.496182053366
(4, 12)       0.61500486583

Assume general form: (A,B) C

A: Document index B: Specific word-vector index C: TFIDF score for word B in document A

This is a sparse matrix. It indicates the tfidf score for all non-zero values in the word vector for each document.

Upvotes: 13

Related Questions