Reputation: 11
I am recently working on an assignment where the task is to use 20_newgroups dataset and use 3 different vectorization technique (Bag of words, TF, TFIDF) to represent documents in vector format and then trying to analyze the difference between average cosine similarity between each class in 20_Newsgroups data set. So here is what I am trying to do in python. I am reading data and passing it to sklearn.feature_extraction.text.CountVectorizer class's fit() and transform() function for Bag of Words technique and TfidfVectorizer for TFIDF technique.
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity,cosine_distances
import numpy
import math
import csv
===============================================================================================================================================
categories = ['alt.atheism','comp.graphics','comp.os.ms-windows.misc','comp.sys.ibm.pc.hardware','comp.sys.mac.hardware', 'comp.windows.x','misc.forsale','rec.autos','rec.motorcycles','rec.sport.baseball','rec.sport.hockey',
'sci.crypt','sci.electronics','sci.med','sci.space','soc.religion.christian','talk.politics.guns',
'talk.politics.mideast','talk.politics.misc','talk.religion.misc']
twenty_newsgroup = fetch_20newsgroups(subset='all',remove=('headers', 'footers', 'quotes'),shuffle=True, random_state=42)
dataset_groups = []
for group in range(0,20):
category = []
category.append(categories[group])
dataset_groups.append(fetch_20newsgroups(subset='all',remove=('headers','footers','quotes'),shuffle=True,random_state=42,categories=category))
===============================================================================================================================================
bag_of_word_vect = CountVectorizer(stop_words='english',analyzer='word') #,min_df = 0.09
bag_of_word_vect = bag_of_word_vect.fit(twenty_newsgroup.data,twenty_newsgroup.target)
datamatrix_bow_groups = []
for group in dataset_groups:
datamatrix_bow_groups.append(bag_of_word_vect.transform(group.data))
similarity_matrix = []
for i in range(0,20):
means = []
for j in range(i,20):
result_of_group_ij = cosine_similarity(datamatrix_bow_groups[i], datamatrix_bow_groups[j])
means.append(numpy.mean(result_of_group_ij))
similarity_matrix.append(means)
===============================================================================================================================================
tf_vectorizer = TfidfVectorizer(stop_words='english',analyzer='word',use_idf=False) #,sublinear_tf=True
tf_vectorizer = tf_vectorizer.fit(twenty_newsgroup.data)
datamatrix_tf_groups = []
for group in dataset_groups:
datamatrix_tf_groups.append(tf_vectorizer.transform(group.data))
similarity_matrix = []
for i in range(0,20):
means = []
for j in range(i,20):
result_of_group_ij = cosine_similarity(datamatrix_tf_groups[i], datamatrix_tf_groups[j])
means.append(numpy.mean(result_of_group_ij))
similarity_matrix.append(means)
Both should technically give different similarity_matrix but they are yeilding the same. More precisiosly tf_vectorizer should create similarity_matrix which have values more closed to 1.
The problem here is, Vector created by both technique for the same document of the same class for example (alt.atheism) is different and it should be. but when I calculating a similarity score between documents of one class and another class, Cosine similarity scorer giving me same value. If we understand theoretically then TFIDF is representing a document in a more finer sense in vector space so cosine value should be more near to 1 then what I get from BAG OF WORD technique right? But it is giving same similarity score. I tried by printing values of matrices created by BOW & TFIDF technique. It would a great help if somebody can give me a good reason to resolve this issue or strong argument in support what is happening? I am new to this platform so please ignore any mistakes and let me know if you need more info.
Thanks & Regards, Darshan Sonagara
Upvotes: 1
Views: 1231
Reputation: 2471
The problem is this line in your code.
tf_vectorizer = TfidfVectorizer(stop_words='english',analyzer='word',use_idf=False) #,sublinear_tf=True
You have set use_idf
to False
. This means the inverse document frequency is not calculated.So only the term frequency is calculated. Basicaly you are using the TfidfVectorizer
like a CountVectorizer
. Hence the output of both is the same: resulting in the same cosine distances.
using tf_vectorizer = TfidfVectorizer(stop_words='english',analyzer='word',use_idf=True)
Will result in a cosine similarity matrix for tfidf that is different from the countvectorizer.
Upvotes: 1