Reputation: 31
I'm trying to replicate Mikolov's work of PV-DM + PV-DBOW. He says that both algorithms should be used in order to get better results. For this reason I'm trying to train the model and then give the document tags to t-SNE.
Using Gensim's Doc2Vec I can get the document tags with docvecs.vectors_docs
, but the concatenated structure doesn't appear to have the document tags of the joint model. It is still treating the models as separate entities.
(This I can see from the variable explorer)
I'm also using the ConcatenatedDoc2Vec
from gensim.
Can anyone help me? Is there a way I can get the document tags from the concatenated new entity and not the individual ones?
Upvotes: 3
Views: 1566
Reputation: 54243
Be warned that many have tried to reproduce the reported 'Paragraph Vector' results using concatenated PV-DBOW and PV-DM+dm_concat vectors without success. (For example, Mikolov himself reports being unable to reproduce the exact numbers that he says co-author Le contributed to the paper.)
The ConcatenatedDoc2Vec
class is just a thin wrapper to join two models you've already trained separately, for the purposes of vector-lookup-by-tag (__getitem__()
indexed access) and combined inference. (It's a mere 10 lines of code.)
To make this post-training join sensible, those two models should have been trained with the exact same documents/tags in the exact same order.
So if you need a list of tags, ask either model separately.
If you need some other combination of the two models – such as a single large array including all concatenated vectors – you'd have to construct that yourself, perhaps using numpy
's hstack
method.
You can see my notebook trying to reproduce some of the paper's results inside the gensim
docs/notebooks
directory, or view online at:
https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/doc2vec-IMDB.ipynb
Upvotes: 2