Reputation: 73
This might be a strange question, but I cant help it wonder. If I lets say have three documents:
And if i transform all these 3 documents into TFIDF
valued vectors, in vector space, will the documents d1
and d2
be closer to each other then documents d2
and d3
for example? Sorry if it is a stupid question, but I would really like to visualize somehow this in order to better understand it. Thank you in advance!
Upvotes: 0
Views: 27
Reputation: 210842
Yes, they will be closer.
Demo:
In [21]: from sklearn.feature_extraction.text import TfidfVectorizer
In [22]: from sklearn.metrics.pairwise import cosine_similarity
In [23]: tfidf = TfidfVectorizer(max_features=50000, use_idf=True, ngram_range=(1,3))
In [24]: r = tfidf.fit_transform(data)
In [25]: s = cosine_similarity(r)
In [26]: s
Out[26]:
array([[1. , 0.53634991, 0. ],
[0.53634991, 1. , 0. ],
[0. , 0. , 1. ]])
In [27]: data
Out[27]: ['My name is Stefan.', 'My name is David.', 'Hello, how are you?']
Upvotes: 2