M S
M S

Reputation: 944

Calculation of Cosine Similarity of a single word in 2 different Word2Vec Models

I build two word embedding (word2vec models) using gensim and save it as (word2vec1 and word2vec2) by using the model.save(model_name) command for two different corpus (the two corpuses are somewhat similar, similar means they are related like part 1 and part 2 of a book). Suppose, the top words (in terms of frequency or occurrence) for the two corpuses is the same word (let's say it as a).

How to compute the degree of similarity (cosine-similarity or similarity) of the extracted top word (say 'a'), for the two word2vec models? Does most_similar() will work in this case efficiently?

I want to know by how much degree of similarity, does the same word (a), is related for two different generated models?

Any idea is deeply appreciated.

Upvotes: 1

Views: 2947

Answers (1)

aneesh joshi
aneesh joshi

Reputation: 583

You seem to have the wrong idea about word2vec. It doesn't provide one absolute vector for one word. It manages to find a representation for a word relative to other words. So, for the same corpus, if you run word2vec twice, you will get 2 different vectors for the same word. The meaning comes in when you compare it relative to other word vectors.

king - man will always be close(cosine similarity wise) to queen - woman no matter how many time you train it. But they will have different vectors after each train.

In your case, since the 2 models are trained differently, comparing vectors of the same word is the same as comparing two random vectors. You should rather compare the relative relations. Maybe something like: model1.most_similar('dog') vs model2.most_similar('dog')

However, to answer your question, if you wanted to compare the 2 vectors, you could do it as below. But the results will be meaningless.

Just take the vectors from each model and manually calculate cosine similarity.

vec1 = model1.wv['computer']
vec2 = model2.wv['computer']
print(np.sum(vec1*vec2)/(np.linalg.norm(vec1)*np.linalg.norm(vec2)))

Upvotes: 5

Related Questions