Reputation: 944
I build two word embedding (word2vec models) using gensim
and save it as (word2vec1 and word2vec2) by using the model.save(model_name)
command for two different corpus (the two corpuses are somewhat similar, similar means they are related like part 1 and part 2 of a book). Suppose, the top words (in terms of frequency or occurrence) for the two corpuses is the same word (let's say it as a
).
How to compute the degree of similarity (cosine-similarity or similarity
) of the extracted top word (say 'a'), for the two word2vec models? Does most_similar()
will work in this case efficiently?
I want to know by how much degree of similarity, does the same word (a), is related for two different generated models?
Any idea is deeply appreciated.
Upvotes: 1
Views: 2947
Reputation: 583
You seem to have the wrong idea about word2vec. It doesn't provide one absolute vector for one word. It manages to find a representation for a word relative to other words. So, for the same corpus, if you run word2vec twice, you will get 2 different vectors for the same word. The meaning comes in when you compare it relative to other word vectors.
king
- man
will always be close(cosine similarity wise) to queen
- woman
no matter how many time you train it. But they will have different vectors after each train.
In your case, since the 2 models are trained differently, comparing vectors of the same word is the same as comparing two random vectors. You should rather compare the relative relations. Maybe something like: model1.most_similar('dog')
vs model2.most_similar('dog')
However, to answer your question, if you wanted to compare the 2 vectors, you could do it as below. But the results will be meaningless.
Just take the vectors from each model and manually calculate cosine similarity.
vec1 = model1.wv['computer']
vec2 = model2.wv['computer']
print(np.sum(vec1*vec2)/(np.linalg.norm(vec1)*np.linalg.norm(vec2)))
Upvotes: 5