Skinish
Skinish

Reputation: 65

How to know specific TF-IDF value of a word?

How can I know the value of a specific word using the TfidfVectorizer function? For example, my code is:

docs = []
docs.append("this is sentence number one")
docs.append("this is sentence number two")
vectorizer = TfidfVectorizer(norm='l2',min_df=0, use_idf=True, smooth_idf=True, stop_words='english', sublinear_tf=True)
sklearn_representation = vectorizer.fit_transform(docs)

Now, how can I know the TF-IDF value of "sentence" in the sentence 2 (docs[1])?

Upvotes: 1

Views: 2379

Answers (1)

juanpa.arrivillaga
juanpa.arrivillaga

Reputation: 95993

You need to use the vectorizer's vocabulary_ attribute, which is a mapping of terms to feature indices.

>>> from sklearn.feature_extraction.text import TfidfVectorizer
>>> docs = []
>>> docs.append("this is sentence number one")
>>> docs.append("this is sentence number two")
>>> vectorizer = TfidfVectorizer(norm='l2',min_df=0, use_idf=True, smooth_idf=True, stop_words='english', sublinear_tf=True)
>>> x = vectorizer.fit_transform(docs)
>>> x.todense()
matrix([[ 0.70710678,  0.70710678],
        [ 0.70710678,  0.70710678]])
>>> vectorizer.vocabulary_['sentence']
1
>>> c = vectorizer.vocabulary_['sentence']
>>> x[:,c]
<2x1 sparse matrix of type '<class 'numpy.float64'>'
    with 2 stored elements in Compressed Sparse Row format>
>>> x[:,c].todense()
matrix([[ 0.70710678],
        [ 0.70710678]])

Upvotes: 1

Related Questions