Understanding the matrix output of Tfidfvectorizer in Sklearn

Question

I'm having trouble interpreting the matrix output for the Tfidf vectorizer.

Given

vectorizer = TfidfVectorizer(max_df=0.5, max_features=10000,
                         min_df=2, stop_words='english',
                         use_idf=True)


X_train_tfidf = vectorizer.fit_transform(X_train_raw)

If I were to look at the output of X_train_tfidf, am I looking at a matrix that is structured like:

Column 1 corresponds to document 1 where its elements are tfidf scores of the 10000 features, Column 2 corresponds to document 2... and so on?

BassFaceIV · Accepted Answer

Assuming you're seeing output similar to this:

(0, 18)       0.424688479366
(0, 6)        0.424688479366
(0, 4)        0.424688479366
(0, 14)       0.239262081323
(0, 17)       0.202366335916
(0, 5)        0.424688479366
(0, 1)        0.424688479366
(1, 17)       0.184426607226
(1, 8)        0.387039944282
(1, 15)       0.387039944282
(1, 0)        0.387039944282
(1, 2)        0.387039944282
(1, 13)       0.387039944282
(1, 7)        0.387039944282
(1, 11)       0.259205161463
(2, 14)       0.313686744222
(2, 17)       0.530628478217
(2, 9)        0.556791722552
(2, 16)       0.556791722552
(3, 14)       0.346483013718
(3, 17)       0.293053113789
(3, 11)       0.411875926253
(3, 10)       0.61500486583
(3, 3)        0.496182053366
(4, 14)       0.346483013718
(4, 17)       0.293053113789
(4, 11)       0.411875926253
(4, 3)        0.496182053366
(4, 12)       0.61500486583

Assume general form: (A,B) C

A: Document index B: Specific word-vector index C: TFIDF score for word B in document A

This is a sparse matrix. It indicates the tfidf score for all non-zero values in the word vector for each document.

Understanding the matrix output of Tfidfvectorizer in Sklearn

Answers (1)

Related Questions