Reputation: 832
I had trained a paragraph vector model from gensim by using a considerable amount text data. I did the next test: I verified the index of any sentence and then inferred a vector for it
>>> x=m.docvecs[18638]
>>> g=m.infer_vector("The seven OxyR target sequences analyzed previously and two new sites grxA at position 207 in GenBank entry M13449 and a second Mu phage mom site at position 59 in GenBank entry V01463 were used to generate an individual information weight matrix".split())
When I computed the cosine similarity, it was very low (the opposite is expected).
>>> 1 - spatial.distance.cosine(g, x)
0.20437437837633066
Can someone tell me if I'm doing something wrong, please?
Thanks
Upvotes: 1
Views: 676
Reputation: 54153
Some thoughts:
If your initial training did any extra preprocessing on text examples – like say case-flattening – you should do that, too, to the tokens as fed to infer_vector()
.
The gensim defaults for the optional parameters of infer_vector()
, including steps=5
and alpha=0.1
, are wild guesses that may be insufficient for many models/training modes. Many have reported better results with much higher steps
(into the hundreds), or a lower starting alpha
(more like the training default of 0.025
).
When the model itself returns most_similar()
results, it does all of its cosine-similarity calculations on unit-length normalized doc-vectors.
– that is, those in the generated-when-needed model.docvecs.doctag_syn0norm
array. However, the vector returned by infer_vector()
will just be the raw, unnormalized inferred vector – analogous to the raw vectors in the model.docvecs.doctag_syn0
array. If computing your own cosine-similarities, be sure to account for this. (I think spatial.distance.cosine()
accounts for this.)
In general re-inferring a vector for the same text as was trained a doc-vector during bulk-training should result in a very-similar (but not identical) vector. So if in fact m.docvecs[18638]
was for the exact same text as you're re-inferring here, the distance should be quite small. (This can be a good 'sanity check' on whether a training process and then later inference is havong the desired effect.) If this expected similarity isn't achieved, you should re-check that the right preprocessing occurred during training, that the model parameters are causing real training to occur, that you're referring to the right trained vector (18638) without any off-by-one/etc errors, and so forth.
Upvotes: 0
Reputation: 1706
The paragraph vector that was stored inside the model (m.docvecs[18638]
) was created during the training phase and then the model might have changed as other paragraphs were used for training. With infer_vector()
, you are using final state of the model. You could try to minimize this difference by adding more epochs to the training phase.
However I would recommend you to always use the infer_vector()
so you can be sure, that all your paragraphs vectors were created with the same version of model.
Upvotes: 1