Reputation: 1613
I'm trying to write a Python method to efficiently return the n closest words to a given word, based on their respective embedding vectors. Each vector is 200 dimensions, and there are a couple million of them.
Here's what I have at the moment, which simply does a cosine similarity comparison against the target word and every other word. This is very, very slow:
def n_nearest_words(word, n, word_vectors):
"""
Return a list of the n nearest words to param word, based on cosine similarity
param word_vectors: dict, keys are words and values are vectors
"""
# get_word_vector() finds the word in the word_vectors dict, using a number of
# possible capitalizations. Returns None if not found
word_vector = get_word_vector(word, word_vectors)
if word_vector:
word_vector = word_vector.reshape((1, -1))
sorted_by_sim = sorted(
word_vectors.keys(),
key=lambda other_word: cosine_similarity(word_vector, word_vectors[other_word].reshape((1, -1))),
reverse=True)
return sorted_by_sim[1:n + 1] # ignore first item, which should be target word itself
return list()
Does anybody have any better suggestions?
Upvotes: 1
Views: 173
Reputation: 41
Perhaps try storing the distance between two words in a dict of dicts, that way you can look up words after you've seen them once.
Upvotes: 2