marlon
marlon

Reputation: 7633

How to speed up word2vec similarity calculation?

I trained a Word2Vec model using Gensim, and I have two sets of words:

S1 = {'','','' ...}
S2 = {'','','' ...}

for each word w1 in S1, I want to find top 5 words that are most similar to w1. I am currently doing this way:

model = w2v_model
 word_similarities = {}
 for w1 in S1:
    similarities = {}
    for w2 in S2:
       if w1 in model.wv and w2 in model.wv:
           similarity = model.similarity(w1, w2)
           similarities[w2] = similarity
    word_similarties[w1] = similarities

Then for each word in word_similarities, I can get the top N from its dict values. If S1 and S2 are large, this becomes very slow.

Is there a quicker way to compute large pairs of words in Word2Vec, either in genism or tensorflow?

Upvotes: 0

Views: 584

Answers (1)

gojomo
gojomo

Reputation: 54153

Depending on the relative sizes of your model, S1, & S2, you may want to use the most_similar() method of gensim's various word-vector classes – which will use a bulk, optimized vector-comparison operations to check against all vectors in your model – then filter down to just the results in S2.

Alternatively, if S2 is much smaller than the full size of model.wv, and especially if you'll be re-using the same S2 set of word-vector many times, you could consider creating your own KeyedVectors instance with just the S2 words in it, by 1st creating an empty KeyedVectors then adding all the S2 words to it, then using s2.most_similar(positive=[target_word_vector], topn=5).

Upvotes: 1

Related Questions