My3
My3

Reputation: 140

tfidf vectorizer giving wrong results

I have documents like below

1             NAME LASTNAME DOB CITY
2                     NAME ADDRESS CITY
3            NAME LASTNAME ADDRESS CITY
4    NAME LASTNAME ADDRESS CITY PINCODE
5                  NAME ADDRESS PINCODE

and TfidfVectorizer gave below values

    address      city      dob  employername  lastname  mobile      name  \
0  0.000000  0.306476  0.68835           0.0  0.553393     0.0  0.354969   
1  0.573214  0.535492  0.00000           0.0  0.000000     0.0  0.620221   
2  0.412083  0.384964  0.00000           0.0  0.695116     0.0  0.445875   
3  0.357479  0.333954  0.00000           0.0  0.603009     0.0  0.386795   
4  0.493437  0.000000  0.00000           0.0  0.000000     0.0  0.533901   

   phone   pincode  
0    0.0  0.000000  
1    0.0  0.000000  
2    0.0  0.000000  
3    0.0  0.497447  
4    0.0  0.686637  

From above, both documents 1 & 3 have 'name' term and also no. of terms is same in both documents, so tf(name) should be same in both cases. Also idf would be same. But why 'name' feature has different tfidf values in both documents?

Please help me understand this.

I actually have many documents and applied tfidf on all of those, given above are top 5 records of data.

Upvotes: 2

Views: 1661

Answers (1)

Vivek Kalyanarangan
Vivek Kalyanarangan

Reputation: 9081

That is because norm='l2' is the default setting. It means it is L2 normalizing the matrix so that all values lie between 0 and 1.

You can turn that off by using norm=None and you will get the same values for Tfidf -

doc = ["NAME LASTNAME DOB CITY", "NAME ADDRESS CITY", 
       "NAME LASTNAME ADDRESS CITY", 
       "NAME LASTNAME ADDRESS CITY PINCODE", "NAME ADDRESS PINCODE"]

from sklearn.feature_extraction.text import TfidfVectorizer

vec = TfidfVectorizer(norm=None)
tf_idf = vec.fit_transform(doc)

print(vec.get_feature_names())
print(tf_idf.todense())

Output

['address', 'city', 'dob', 'lastname', 'name', 'pincode']
[[ 0.          1.18232156  2.09861229  1.40546511  1.          0.        ]
 [ 1.18232156  1.18232156  0.          0.          1.          0.        ]
 [ 1.18232156  1.18232156  0.          1.40546511  1.          0.        ]
 [ 1.18232156  1.18232156  0.          1.40546511  1.          1.69314718]
 [ 1.18232156  0.          0.          0.          1.          1.69314718]]

P.S: It is always better to normalize your feature matrix

UPDATE With L2 Norm, each number is divided with the square root of the sum of the squares of the rows. Example - For row 1:column 4, 1.40546511 will be divided by the square root of the sum of squares of row 1. Here is code that shows this -

import math
first_doc = tf_idf.todense()[0].tolist()[0]
l2 = math.sqrt(sum([i*i for i in first_doc]))
print(l2)
print([i/l2 for i in first_doc])

Output

2.9626660243635254
[0.0, 0.39907351927997176, 0.7083526362438907, 0.4743920160255332, 0.3375338265523302, 0.0]

In this case I just manually calculated what TfidfVectorizer would have done with norm='l2'. Notice how all values are between 0 and 1. This is one of the techniques to normalize your data. Normalization helps algorithms to converge quicker and have more accuracy. I hope this clears it up.

Upvotes: 1

Related Questions