How do similar documents transformed into TFIDF valued vector look in vector space

Question

This might be a strange question, but I cant help it wonder. If I lets say have three documents:

d1 = "My name is Stefan."
d2 = "My name is David."
d3 = "Hello, how are you?"

And if i transform all these 3 documents into TFIDF valued vectors, in vector space, will the documents d1 and d2 be closer to each other then documents d2 and d3 for example? Sorry if it is a stupid question, but I would really like to visualize somehow this in order to better understand it. Thank you in advance!

MaxU - stand with Ukraine · Accepted Answer

Yes, they will be closer.

Demo:

In [21]: from sklearn.feature_extraction.text import TfidfVectorizer

In [22]: from sklearn.metrics.pairwise import cosine_similarity

In [23]: tfidf = TfidfVectorizer(max_features=50000, use_idf=True, ngram_range=(1,3))

In [24]: r = tfidf.fit_transform(data)

In [25]: s = cosine_similarity(r)

In [26]: s
Out[26]:
array([[1.        , 0.53634991, 0.        ],
       [0.53634991, 1.        , 0.        ],
       [0.        , 0.        , 1.        ]])

In [27]: data
Out[27]: ['My name is Stefan.', 'My name is David.', 'Hello, how are you?']

How do similar documents transformed into TFIDF valued vector look in vector space

Answers (1)

Related Questions