Reputation: 140
I have documents like below
1 NAME LASTNAME DOB CITY
2 NAME ADDRESS CITY
3 NAME LASTNAME ADDRESS CITY
4 NAME LASTNAME ADDRESS CITY PINCODE
5 NAME ADDRESS PINCODE
and TfidfVectorizer
gave below values
address city dob employername lastname mobile name \
0 0.000000 0.306476 0.68835 0.0 0.553393 0.0 0.354969
1 0.573214 0.535492 0.00000 0.0 0.000000 0.0 0.620221
2 0.412083 0.384964 0.00000 0.0 0.695116 0.0 0.445875
3 0.357479 0.333954 0.00000 0.0 0.603009 0.0 0.386795
4 0.493437 0.000000 0.00000 0.0 0.000000 0.0 0.533901
phone pincode
0 0.0 0.000000
1 0.0 0.000000
2 0.0 0.000000
3 0.0 0.497447
4 0.0 0.686637
From above, both documents 1 & 3 have 'name' term and also no. of terms is same in both documents, so tf(name) should be same in both cases. Also idf would be same. But why 'name' feature has different tfidf values in both documents?
Please help me understand this.
I actually have many documents and applied tfidf on all of those, given above are top 5 records of data.
Upvotes: 2
Views: 1661
Reputation: 9081
That is because norm='l2'
is the default setting. It means it is L2 normalizing the matrix so that all values lie between 0 and 1.
You can turn that off by using norm=None
and you will get the same values for Tfidf -
doc = ["NAME LASTNAME DOB CITY", "NAME ADDRESS CITY",
"NAME LASTNAME ADDRESS CITY",
"NAME LASTNAME ADDRESS CITY PINCODE", "NAME ADDRESS PINCODE"]
from sklearn.feature_extraction.text import TfidfVectorizer
vec = TfidfVectorizer(norm=None)
tf_idf = vec.fit_transform(doc)
print(vec.get_feature_names())
print(tf_idf.todense())
Output
['address', 'city', 'dob', 'lastname', 'name', 'pincode']
[[ 0. 1.18232156 2.09861229 1.40546511 1. 0. ]
[ 1.18232156 1.18232156 0. 0. 1. 0. ]
[ 1.18232156 1.18232156 0. 1.40546511 1. 0. ]
[ 1.18232156 1.18232156 0. 1.40546511 1. 1.69314718]
[ 1.18232156 0. 0. 0. 1. 1.69314718]]
P.S: It is always better to normalize your feature matrix
UPDATE With L2 Norm, each number is divided with the square root of the sum of the squares of the rows. Example - For row 1:column 4, 1.40546511 will be divided by the square root of the sum of squares of row 1. Here is code that shows this -
import math
first_doc = tf_idf.todense()[0].tolist()[0]
l2 = math.sqrt(sum([i*i for i in first_doc]))
print(l2)
print([i/l2 for i in first_doc])
Output
2.9626660243635254
[0.0, 0.39907351927997176, 0.7083526362438907, 0.4743920160255332, 0.3375338265523302, 0.0]
In this case I just manually calculated what TfidfVectorizer
would have done with norm='l2'
. Notice how all values are between 0 and 1. This is one of the techniques to normalize your data. Normalization helps algorithms to converge quicker and have more accuracy. I hope this clears it up.
Upvotes: 1