Use Word2vec to determine which two words in a group of words is most similar

Question

I am trying to use the python wrapper around Word2vec. I have a word embedding or group of words which can be seen below and from them I am trying to determine which two words are most similar to each other.

How can I do this?

['architect', 'nurse', 'surgeon', 'grandmother', 'dad']

gojomo · Accepted Answer

@rylan-feldspar's answer is generally the correct approach and will work, but you could do this a bit more compactly using standard Python libraries/idioms, especially itertools, a list-comprehension, and sorting functions.

For example, first use combinations() from itertools to generate all pairs of your candidate words:

from itertools import combinations
candidate_words = ['architect', 'nurse', 'surgeon', 'grandmother', 'dad']
all_pairs = combinations(candidate_words, 2)

Then, decorate the pairs with their pairwise similarity:

scored_pairs = [(w2v_model.wv.similarity(p[0], p[1]), p)
                for p in all_pairs]

Finally, sort to put the most-similar pair first, and report that score & pair:

sorted_pairs = sorted(scored_pairs, reverse=True)
print(sorted_pairs[0])  # first item is most-similar pair

If you wanted to be compact but a bit less readable, it could be a (long) "1-liner":

print(sorted([(w2v_model.wv.similarity(p[0], p[1]), p) 
              for p in combinations(candidate_words, 2)
             ], reverse=True)[0])

Update:

Integrating @ryan-feldspar's suggestion about max(), and going for minimality, this should also work to report the best pair (but not its score):

print(max(combinations(candidate_words, 2),
          key=lambda p:w2v_model.wv.similarity(p[0], p[1])))

Use Word2vec to determine which two words in a group of words is most similar

Answers (2)

Related Questions