Reputation: 5278
I'm training a Word2Vec
model like:
model = Word2Vec(documents, size=200, window=5, min_count=0, workers=4, iter=5, sg=1)
and Doc2Vec
model like:
doc2vec_model = Doc2Vec(size=200, window=5, min_count=0, iter=5, workers=4, dm=1)
doc2vec_model.build_vocab(doc2vec_tagged_documents)
doc2vec_model.train(doc2vec_tagged_documents, total_examples=doc2vec_model.corpus_count, epochs=doc2vec_model.iter)
with the same data and comparable parameters.
After this I'm using these models for my classification task. And I have found out that simply averaging or summing the word2vec
embeddings of a document performs considerably better than using the doc2vec
vectors. I also tried with much more doc2vec
iterations (25, 80 and 150 - makes no difference).
Any tips or ideas why and how to improve doc2vec
results?
Update: This is how doc2vec_tagged_documents
is created:
doc2vec_tagged_documents = list()
counter = 0
for document in documents:
doc2vec_tagged_documents.append(TaggedDocument(document, [counter]))
counter += 1
Some more facts about my data:
doc2vec
model like this, but it's almost the same result.Upvotes: 9
Views: 5521
Reputation: 54153
Summing/averaging word2vec vectors is often quite good!
It is more typical to use 10 or 20 iterations with Doc2Vec, rather than the default 5 inherited from Word2Vec. (I see you've tried that, though.)
If your main interest is the doc-vectors – and not the word-vectors that are in some Doc2Vec modes co-trained – definitely try the PV-DBOW mode (dm=0
) as well. It'll train faster and is often a top-performer.
If your corpus is very small, or the docs very short, it may be hard for the doc-vectors to become generally meaningful. (In some cases, decreasing the vector size
may help.) But especially if window
is a large proportion of the average doc size, what's learned by word-vectors and what's learned by the doc-vectors will be very, very similar. And since the words may get trained more times, in more diverse contexts, they may have more generalizable meaning – unless you have a larger collections of longer docs.
Other things that sometimes help improve Doc2Vec vectors for classification purposes:
re-inferring all document vectors, at the end of training, perhaps even using parameters different from infer_vector()
defaults, such as infer_vector(tokens, steps=50, alpha=0.025)
– while quite slow, this means all docs get vectors from the same final model state, rather than what's left-over from bulk training
where classification labels are known, adding them as trained doc-tags, using the capability of TaggedDocument
tags
to be a list of tags
rare words are essentially just noise to Word2Vec or Doc2Vec - so a min_count
above 1, perhaps significatly higher, often helps. (Singleton words mixed in may be especially damaging to individual doc-ID doc-vectors that are also, by design, singletons. The training process is also, in competition to the doc-vector, trying to make those singleton word-vectors predictive of their single-document neighborhoods... when really, for your purposes, you just want the doc-vector to be most descriptive. So this suggests both trying PV-DBOW, and increasing min_count
.)
Hope this helps.
Upvotes: 22