user2481422
user2481422

Reputation: 868

Difference in values of tf-idf matrix using scikit-learn and hand calculation

I am playing with scikit-learn to find the tf-idf values.

I have a set of documents like:

D1 = "The sky is blue."
D2 = "The sun is bright."
D3 = "The sun in the sky is bright."

I want to create a matrix like this:

   Docs      blue    bright       sky       sun
   D1 tf-idf 0.0000000 tf-idf 0.0000000
   D2 0.0000000 tf-idf 0.0000000 tf-idf
   D3 0.0000000 tf-idf tf-idf tf-idf

So, my code in Python is:

import nltk
import string

from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.corpus import stopwords

train_set = ["sky is blue", "sun is bright", "sun in the sky is bright"]
stop_words = stopwords.words('english')

transformer = TfidfVectorizer(stop_words=stop_words)

t1 = transformer.fit_transform(train_set).todense()
print t1

The result matrix I get is:

[[ 0.79596054  0.          0.60534851  0.        ]
 [ 0.          0.4472136   0.          0.89442719]
 [ 0.          0.57735027  0.57735027  0.57735027]]

If I do a hand calculation then the matrix should be:

            Docs  blue      bright       sky       sun
            D1    0.2385    0.0000000  0.0880    0.0000000
            D2    0.0000000 0.0880     0.0000000 0.0880
            D3    0.0000000 0.058      0.058     0.058 

I am calculating like say blue as tf = 1/2 = 0.5 and idf as log(3/1) = 0.477121255. Therefore tf-idf = tf*idf = 0.5*0.477 = 0.2385. In this way, I am calculating the other tf-idf values. Now, I am wondering, why I am getting different results in the matrix of hand calculation and in the matrix of Python? Which gives the correct results? Am I doing something wrong in hand calculation or is there something wrong in my Python code?

Upvotes: 8

Views: 3042

Answers (2)

Kevin Wang
Kevin Wang

Reputation: 43

smooth_idf : boolean, default=True

Smoothed version idf is used. There are many versions. In python, the following version is used: $1+ log( (N+1)/n+1))$, where $N$ the number of total number of documents, and $n$ the number of documents containing the term.

tf : 1/2, 1/2
idf with smoothing: (log(4/2)+1) ,(log(4/3)+1)
tf-idf : 1/2* (log(4/2)+1) ,1/2 * (log(4/3)+1)
L-2 normalization: 0.79596054 0.60534851

By the way, the 2nd in the original problem maybe wrong, which should be the same. my out put from python

Upvotes: 0

lejlot
lejlot

Reputation: 66805

There are two reasons:

  1. You are neglecting smoothing which often occurs in such cases
  2. You are assuming logarithm of base 10

According to source sklearn does not use such assumptions.

First, it smooths document count (so there is no 0, ever):

df += int(self.smooth_idf)
n_samples += int(self.smooth_idf)

and it uses natural logarithm (np.log(np.e)==1)

idf = np.log(float(n_samples) / df) + 1.0

There is also default l2 normalization applied. In short, scikit-learn does much more "nice, little things" while computing tfidf. None of these approaches (their or yours) is bad. Their is simply more advanced.

Upvotes: 14

Related Questions