Reputation: 8025
i am trying to understand how python-glove computes most-similar
terms.
Is it using cosine similarity?
Example from python-glove github
https://github.com/maciejkula/glove-python/tree/master/glove
:
I know that from gensim's word2vec, the most_similar
method computes similarity using cosine distance.
Upvotes: 4
Views: 7451
Reputation: 1
yes it uses the cosine similarity.
the paper mentioning that in text : ... A similarity score is obtained from the word vectors by first normalizing each feature across the vocabulary and then calculating the cosine similarity....
Upvotes: -1
Reputation: 291
The project website is a bit unclear on this point:
The Euclidean distance (or cosine similarity) between two word vectors provides an effective method for measuring the linguistic or semantic similarity of the corresponding words.
Euclidean distance is not the same as cosine similarity. It sounds like either works well enough, but it does not specify which is used.
However, we can observe the source of the repo you are looking at to see:
dst = (np.dot(self.word_vectors, word_vec)
/ np.linalg.norm(self.word_vectors, axis=1)
/ np.linalg.norm(word_vec))
It uses cosine similarity.
Upvotes: 2
Reputation: 3337
On the glove project website, this is explained with a fair amount of clarity. http://www-nlp.stanford.edu/projects/glove/
In order to capture in a quantitative way the nuance necessary to distinguish man from woman, it is necessary for a model to associate more than a single number to the word pair. A natural and simple candidate for an enlarged set of discriminative numbers is the vector difference between the two word vectors. GloVe is designed in order that such vector differences capture as much as possible the meaning specified by the juxtaposition of two words.
To read more about the math behind this, check the "Model overview" section in the website
Upvotes: 1