Reputation: 2059
Let's say we train a model with more than 1 million words. In order to find the most similar words we need to calculate the distance between the embedding of the test word and embeddings of all the 1 million words words, and then find the nearest words. It seems that Gensim calculate the results very fast. Although when I want to calculate the most similar, my function is extremely slow:
def euclidean_most_similars (model, word, topn = 10):
distances = {}
vec1 = model[word]
for item in model.wv.vocab:
if item!= node:
vec2 = model[item]
dist = np.linalg.norm(vec1 - vec2)
distances[(node, item)] = dist
sorted_distances = sorted(distances.items(), key=operator.itemgetter(1))
I would like to know how Gensim manages to calculate the most nearest words so fast and what is an efficient way to calculate the most similares.
Upvotes: 3
Views: 1516
Reputation: 54173
As @g-anderson commented, the gensim
source can be reviewed to see exactly what it does. However, gensim
is not actually using any of its own optimized Cython or compiled-C code as part of its most_similar()
method – which can be reviewed at:
Instead, by using numpy
/scipy
bulk array operations, those libraries' highly optimized code will take advantage of both CPU primitives and multithreading to calculate all the relevant similarities far faster than an interpreted Python loop.
(The key workhorse is the numpy
dot
operation: one call which creates an ordered array of all the similarities – skipping the loop & your interim-results dict
entirely. But the argsort
, passing through to numpy
implementations as well, likely also outperforms the idiomatic sorted()
.)
Upvotes: 1