Reputation: 4209
I was reading this answer That says about Gensim most_similar
:
it performs vector arithmetic: adding the positive vectors, subtracting the negative, then from that resulting position, listing the known-vectors closest to that angle.
But when I tested it, that is not the case. I trained a Word2Vec with Gensim "text8"
dataset and tested these two:
model.most_similar(positive=['woman', 'king'], negative=['man'])
>>> [('queen', 0.7131118178367615), ('prince', 0.6359186768531799),...]
model.wv.most_similar([model["king"] + model["woman"] - model["man"]])
>>> [('king', 0.84305739402771), ('queen', 0.7326322793960571),...]
They are clearly not the same. even the queen score in the first is 0.713
and on the second 0.732
which are not the same.
So I ask the question again, How does Gensim most_similar
work? why the result of the two above are different?
Upvotes: 4
Views: 3386
Reputation: 54243
The adding and subtracting isn't all that it does; for an exact description, you should look at the source code:
You'll see there that the addition and subtraction is on the unit-normed version of each vector, via the get_vector(key, use_norm=True)
accessor.
If you change your use of model[key]
to model.get_vector(key, use_norm=True)
, you should see your outside-the-method calculation of the target vector give the same results as letting the method combine the positive
and negative
vectors.
Upvotes: 6