Reputation:
I am using the following python code to generate similarity matrix of word vectors (My vocabulary size is 77
).
similarity_matrix = []
index = gensim.similarities.MatrixSimilarity(gensim.matutils.Dense2Corpus(model.wv.syn0))
for sims in index:
similarity_matrix.append(sims)
similarity_array = np.array(similarity_matrix)
The dimensionality of the similarity_array
is 300 X 300
. However as I understand the dimensionality should be 77 x 77
(as my vocabulary size is 77).
i.e.,
word1, word2, ......, word77
word1 0.2, 0.8, ..., 0.9
word2 0.1, 0.2, ...., 1.0
... ...., ....., ....., ....
word77 0.9, 0.8, ..., 0.1
Please let me know what is wrong in my code.
Moreover, I want to know what is the order of the vocabulary (word1, word2, ..., word77)
used to calculate this similarity matrix? Can I obtain this order
from model.wv.index2word
?
Please help me!
Upvotes: 3
Views: 4509
Reputation: 41
it's been long since this question has been posted, but maybe my answer will be of help.
The code below gives the same results as index = gensim.similarities.MatrixSimilarity(gensim.matutils.Dense2Corpus(model.wv.syn0.T))
, with the for loop, but is more concise.
import numpy as np
similarity_matrix = np.dot(model.wv.syn0norm, model.wv.syn0norm.T)
It calculates the dot product between normalized word-vectors, i.e. distances between the pairs.
Upvotes: 3
Reputation: 1334
Try to replace
index = gensim.similarities.MatrixSimilarity(gensim.matutils.Dense2Corpus(model.wv.syn0))
to
index = gensim.similarities.MatrixSimilarity(gensim.matutils.Dense2Corpus(model.wv.syn0.T))
Upvotes: 4