inconsistent similarity betwen inferred and trained vectors in doc2vec

Question

I had trained a paragraph vector model from gensim by using a considerable amount text data. I did the next test: I verified the index of any sentence and then inferred a vector for it

>>> x=m.docvecs[18638]
>>> g=m.infer_vector("The seven OxyR target sequences analyzed previously and two new sites grxA at position 207 in GenBank entry M13449 and a second Mu phage mom site at position 59 in GenBank entry V01463 were used to generate an individual information weight matrix".split())

When I computed the cosine similarity, it was very low (the opposite is expected).

>>> 1 - spatial.distance.cosine(g, x)
0.20437437837633066

Can someone tell me if I'm doing something wrong, please?

Thanks

gojomo · Accepted Answer

Some thoughts:

If your initial training did any extra preprocessing on text examples – like say case-flattening – you should do that, too, to the tokens as fed to infer_vector().

The gensim defaults for the optional parameters of infer_vector(), including steps=5 and alpha=0.1, are wild guesses that may be insufficient for many models/training modes. Many have reported better results with much higher steps (into the hundreds), or a lower starting alpha (more like the training default of 0.025).

When the model itself returns most_similar() results, it does all of its cosine-similarity calculations on unit-length normalized doc-vectors. – that is, those in the generated-when-needed model.docvecs.doctag_syn0norm array. However, the vector returned by infer_vector() will just be the raw, unnormalized inferred vector – analogous to the raw vectors in the model.docvecs.doctag_syn0 array. If computing your own cosine-similarities, be sure to account for this. (I think spatial.distance.cosine() accounts for this.)

In general re-inferring a vector for the same text as was trained a doc-vector during bulk-training should result in a very-similar (but not identical) vector. So if in fact m.docvecs[18638] was for the exact same text as you're re-inferring here, the distance should be quite small. (This can be a good 'sanity check' on whether a training process and then later inference is havong the desired effect.) If this expected similarity isn't achieved, you should re-check that the right preprocessing occurred during training, that the model parameters are causing real training to occur, that you're referring to the right trained vector (18638) without any off-by-one/etc errors, and so forth.

inconsistent similarity betwen inferred and trained vectors in doc2vec

Answers (2)

Related Questions