Nacho
Nacho

Reputation: 832

inconsistent similarity betwen inferred and trained vectors in doc2vec

I had trained a paragraph vector model from gensim by using a considerable amount text data. I did the next test: I verified the index of any sentence and then inferred a vector for it

>>> x=m.docvecs[18638]
>>> g=m.infer_vector("The seven OxyR target sequences analyzed previously and two new sites grxA at position 207 in GenBank entry M13449 and a second Mu phage mom site at position 59 in GenBank entry V01463 were used to generate an individual information weight matrix".split())

When I computed the cosine similarity, it was very low (the opposite is expected).

>>> 1 - spatial.distance.cosine(g, x)
0.20437437837633066

Can someone tell me if I'm doing something wrong, please?

Thanks

Upvotes: 1

Views: 676

Answers (2)

gojomo
gojomo

Reputation: 54153

Some thoughts:

If your initial training did any extra preprocessing on text examples – like say case-flattening – you should do that, too, to the tokens as fed to infer_vector().

The gensim defaults for the optional parameters of infer_vector(), including steps=5 and alpha=0.1, are wild guesses that may be insufficient for many models/training modes. Many have reported better results with much higher steps (into the hundreds), or a lower starting alpha (more like the training default of 0.025).

When the model itself returns most_similar() results, it does all of its cosine-similarity calculations on unit-length normalized doc-vectors. – that is, those in the generated-when-needed model.docvecs.doctag_syn0norm array. However, the vector returned by infer_vector() will just be the raw, unnormalized inferred vector – analogous to the raw vectors in the model.docvecs.doctag_syn0 array. If computing your own cosine-similarities, be sure to account for this. (I think spatial.distance.cosine() accounts for this.)

In general re-inferring a vector for the same text as was trained a doc-vector during bulk-training should result in a very-similar (but not identical) vector. So if in fact m.docvecs[18638] was for the exact same text as you're re-inferring here, the distance should be quite small. (This can be a good 'sanity check' on whether a training process and then later inference is havong the desired effect.) If this expected similarity isn't achieved, you should re-check that the right preprocessing occurred during training, that the model parameters are causing real training to occur, that you're referring to the right trained vector (18638) without any off-by-one/etc errors, and so forth.

Upvotes: 0

Lenka Vraná
Lenka Vraná

Reputation: 1706

The paragraph vector that was stored inside the model (m.docvecs[18638]) was created during the training phase and then the model might have changed as other paragraphs were used for training. With infer_vector(), you are using final state of the model. You could try to minimize this difference by adding more epochs to the training phase.

However I would recommend you to always use the infer_vector() so you can be sure, that all your paragraphs vectors were created with the same version of model.

Upvotes: 1

Related Questions