Stefan Falk
Stefan Falk

Reputation: 25387

How to load pre-trained model with in gensim and train doc2vec with it?

I am having a ready to go word2vec model that I already trained. I have serialized it as a CSV file:

word,  v0,     v1,     ..., vN
house, 0.1234, 0.4567, ..., 0.3461
car,   0.456,  0.677,  ..., 0.3461

What I'd like to know is how I can load that word vector model in gensim and use that to train a paragraph or doc2vec model.

This Doc2Vec tutorial says I can load a model in form of a "# C text format" but I have no idea what that actually means. What is "C text format" in the first place but more important:

How do I build the vocabulary from my word2vec model?

Upvotes: 1

Views: 2762

Answers (1)

gojomo
gojomo

Reputation: 54173

Doc2Vec does not need word-vectors as an input: it will create any word-vectors that are needed during its own training. (And some modes, like pure DBOW – dm=0, dbow_words=0 – don't use or train word-vectors at all.)

Seeding a Doc2Vec model with word-vectors might help or hurt; there's not much theory or published results to offer guidance. There's an experimental method on Word2Vec, intersect_word2vec_format(), that can merge word2vec-c-format vectors into a model with an existing vocabulary, but you'd need to review the source to really understand its assumptions:

https://github.com/RaRe-Technologies/gensim/blob/51753b95415bbc344ea6af671818277464905ea2/gensim/models/word2vec.py#L1140

Upvotes: 1

Related Questions