whenitrains
whenitrains

Reputation: 498

Why do passing 'positive' and 'negative' parameters into gensim's most_similar function not return the same as the vector math results?

model.similar_by_vector(model['king'] - model['man'] + model['woman'], topn=1)[0]

Results in

('king', 0.8551837205886841)

Whereas

model.most_similar(positive=['king', 'queen'], negative=['man'], topn=1)[0]

Gives a different answer (the one you'd expect)

('monarch', 0.6350384950637817)

But I'd expect both of these to return the same thing. Am I misunderstanding how vector math should be performed on these vectors?

Upvotes: 1

Views: 558

Answers (1)

gojomo
gojomo

Reputation: 54243

You can look at the source code for the most_similar() (and similar_by_vector()) methods if you'd like to closely review how what they're doing is different than what you might expect, for example by browsing the project's source repository online:

https://github.com/RaRe-Technologies/gensim/blob/f97d0e793faa57877a2bbedc15c287835463eaa9/gensim/models/keyedvectors.py#L491

https://github.com/RaRe-Technologies/gensim/blob/f97d0e793faa57877a2bbedc15c287835463eaa9/gensim/models/keyedvectors.py#L598

In particular, there are likely a couple factors at play in the discrepancy you're seeing:

  • When you supply look-up keys (word-tokens) to most_similar(), it will disqualify returning those same keys as answers, on the assumption you want the answer that's not already in your passed-in-parameters. That is, even if the target-location is closest to 'king', if 'king' was one of the supplied keys, it will be ignored as a possible ranked response.

  • most_similar() uses the unit-length normalized versions of each input word (via the use of word_vec(word, use_norm=True) for lookup, whereas a bracket-lookup (like model[word]) uses the raw, non-normalized vectors.

Upvotes: 1

Related Questions