Reputation: 498
model.similar_by_vector(model['king'] - model['man'] + model['woman'], topn=1)[0]
Results in
('king', 0.8551837205886841)
Whereas
model.most_similar(positive=['king', 'queen'], negative=['man'], topn=1)[0]
Gives a different answer (the one you'd expect)
('monarch', 0.6350384950637817)
But I'd expect both of these to return the same thing. Am I misunderstanding how vector math should be performed on these vectors?
Upvotes: 1
Views: 558
Reputation: 54243
You can look at the source code for the most_similar()
(and similar_by_vector()
) methods if you'd like to closely review how what they're doing is different than what you might expect, for example by browsing the project's source repository online:
In particular, there are likely a couple factors at play in the discrepancy you're seeing:
When you supply look-up keys (word-tokens) to most_similar()
, it will disqualify returning those same keys as answers, on the assumption you want the answer that's not already in your passed-in-parameters. That is, even if the target-location is closest to 'king'
, if 'king'
was one of the supplied keys, it will be ignored as a possible ranked response.
most_similar()
uses the unit-length normalized versions of each input word (via the use of word_vec(word, use_norm=True)
for lookup, whereas a bracket-lookup (like model[word]
) uses the raw, non-normalized vectors.
Upvotes: 1