Gensim Doc2Vec getting the doc tags from the Concatenated model

I'm trying to replicate Mikolov's work of PV-DM + PV-DBOW. He says that both algorithms should be used in order to get better results. For this reason I'm trying to train the model and then give the document tags to t-SNE. Using Gensim's Doc2Vec I can get the document tags with docvecs.vectors_docs, but the concatenated structure doesn't appear to have the document tags of the joint model. It is still treating the models as separate entities. (This I can see from the variable explorer)

I'm also using the ConcatenatedDoc2Vec from gensim.

Can anyone help me? Is there a way I can get the document tags from the concatenated new entity and not the individual ones?

Upvotes: 3

Views: 1566

Answers (1)

gojomo
gojomo

Reputation: 54243

Be warned that many have tried to reproduce the reported 'Paragraph Vector' results using concatenated PV-DBOW and PV-DM+dm_concat vectors without success. (For example, Mikolov himself reports being unable to reproduce the exact numbers that he says co-author Le contributed to the paper.)

The ConcatenatedDoc2Vec class is just a thin wrapper to join two models you've already trained separately, for the purposes of vector-lookup-by-tag (__getitem__() indexed access) and combined inference. (It's a mere 10 lines of code.)

To make this post-training join sensible, those two models should have been trained with the exact same documents/tags in the exact same order.

So if you need a list of tags, ask either model separately.

If you need some other combination of the two models – such as a single large array including all concatenated vectors – you'd have to construct that yourself, perhaps using numpy's hstack method.

You can see my notebook trying to reproduce some of the paper's results inside the gensim docs/notebooks directory, or view online at:

https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/doc2vec-IMDB.ipynb

Upvotes: 2

Related Questions