Reputation: 71
I am trying to use the python wrapper around Word2vec. I have a word embedding or group of words which can be seen below and from them I am trying to determine which two words are most similar to each other.
How can I do this?
['architect', 'nurse', 'surgeon', 'grandmother', 'dad']
Upvotes: 5
Views: 2213
Reputation: 54243
@rylan-feldspar's answer is generally the correct approach and will work, but you could do this a bit more compactly using standard Python libraries/idioms, especially itertools
, a list-comprehension, and sorting functions.
For example, first use combinations()
from itertools
to generate all pairs of your candidate words:
from itertools import combinations
candidate_words = ['architect', 'nurse', 'surgeon', 'grandmother', 'dad']
all_pairs = combinations(candidate_words, 2)
Then, decorate the pairs with their pairwise similarity:
scored_pairs = [(w2v_model.wv.similarity(p[0], p[1]), p)
for p in all_pairs]
Finally, sort to put the most-similar pair first, and report that score & pair:
sorted_pairs = sorted(scored_pairs, reverse=True)
print(sorted_pairs[0]) # first item is most-similar pair
If you wanted to be compact but a bit less readable, it could be a (long) "1-liner":
print(sorted([(w2v_model.wv.similarity(p[0], p[1]), p)
for p in combinations(candidate_words, 2)
], reverse=True)[0])
Update:
Integrating @ryan-feldspar's suggestion about max()
, and going for minimality, this should also work to report the best pair (but not its score):
print(max(combinations(candidate_words, 2),
key=lambda p:w2v_model.wv.similarity(p[0], p[1])))
Upvotes: 3
Reputation: 614
Given you're using gensim's word2vec, according to your comment:
Load up or train the model for your embeddings and then, on your model, you can call:
min_distance = float('inf')
min_pair = None
word2vec_model_wv = model.wv # Unsure if this can be done in the loop, but just to be safe efficiency-wise
for candidate_word1 in words:
for candidate_word2 in words:
if candidate_word1 == candidate_word2:
continue # ignore when the two words are the same
distance = word2vec_model_wv.distance(candidate_word1, candidate_word2)
if distance < min_distance:
min_pair = (candidate_word1, candidate_word2)
min_distance = distance
Could also be similarity (I'm not entirely sure if there's a difference). https://radimrehurek.com/gensim/models/keyedvectors.html#gensim.models.keyedvectors.WordEmbeddingsKeyedVectors.similarity
If similarity gets bigger with closer words, as I'd expect, then you'll want to maximize not minimize and just replace the distance function calls with similarity calls. Basically this is just the simple min/max function over the pairs.
Upvotes: 2