Mike S
Mike S

Reputation: 1613

Python - Efficiently find n nearest vectors

I'm trying to write a Python method to efficiently return the n closest words to a given word, based on their respective embedding vectors. Each vector is 200 dimensions, and there are a couple million of them.

Here's what I have at the moment, which simply does a cosine similarity comparison against the target word and every other word. This is very, very slow:

def n_nearest_words(word, n, word_vectors):
    """
    Return a list of the n nearest words to param word, based on cosine similarity
    param word_vectors: dict, keys are words and values are vectors
    """
    # get_word_vector() finds the word in the word_vectors dict, using a number of
    # possible capitalizations. Returns None if not found
    word_vector = get_word_vector(word, word_vectors)
    if word_vector:
        word_vector = word_vector.reshape((1, -1))
        sorted_by_sim = sorted(
            word_vectors.keys(),
            key=lambda other_word: cosine_similarity(word_vector, word_vectors[other_word].reshape((1, -1))),
            reverse=True)
        return sorted_by_sim[1:n + 1] # ignore first item, which should be target word itself
    return list()

Does anybody have any better suggestions?

Upvotes: 1

Views: 173

Answers (1)

Meghan
Meghan

Reputation: 41

Perhaps try storing the distance between two words in a dict of dicts, that way you can look up words after you've seen them once.

Upvotes: 2

Related Questions