How does gensim manage to find the most similar words so fast?

Question

Let's say we train a model with more than 1 million words. In order to find the most similar words we need to calculate the distance between the embedding of the test word and embeddings of all the 1 million words words, and then find the nearest words. It seems that Gensim calculate the results very fast. Although when I want to calculate the most similar, my function is extremely slow:

def euclidean_most_similars (model, word, topn = 10):
  distances = {}
  vec1 = model[word]
  for item in model.wv.vocab:
    if item!= node:
      vec2 = model[item]
      dist = np.linalg.norm(vec1 - vec2)
      distances[(node, item)] = dist
  sorted_distances = sorted(distances.items(), key=operator.itemgetter(1))

I would like to know how Gensim manages to calculate the most nearest words so fast and what is an efficient way to calculate the most similares.

gojomo · Accepted Answer

As @g-anderson commented, the gensim source can be reviewed to see exactly what it does. However, gensim is not actually using any of its own optimized Cython or compiled-C code as part of its most_similar() method – which can be reviewed at:

https://github.com/RaRe-Technologies/gensim/blob/b287fd841c31d0dfa899d784da0bd5b3669e104d/gensim/models/keyedvectors.py#L689

Instead, by using numpy/scipy bulk array operations, those libraries' highly optimized code will take advantage of both CPU primitives and multithreading to calculate all the relevant similarities far faster than an interpreted Python loop.

(The key workhorse is the numpy dot operation: one call which creates an ordered array of all the similarities – skipping the loop & your interim-results dict entirely. But the argsort, passing through to numpy implementations as well, likely also outperforms the idiomatic sorted().)

How does gensim manage to find the most similar words so fast?

Answers (1)

Related Questions