ScientiaEtVeritas
ScientiaEtVeritas

Reputation: 5278

Doc2Vec Worse Than Mean or Sum of Word2Vec Vectors

I'm training a Word2Vec model like:

model = Word2Vec(documents, size=200, window=5, min_count=0, workers=4, iter=5, sg=1)

and Doc2Vec model like:

doc2vec_model = Doc2Vec(size=200, window=5, min_count=0, iter=5, workers=4, dm=1)
doc2vec_model.build_vocab(doc2vec_tagged_documents)
doc2vec_model.train(doc2vec_tagged_documents, total_examples=doc2vec_model.corpus_count, epochs=doc2vec_model.iter)

with the same data and comparable parameters.

After this I'm using these models for my classification task. And I have found out that simply averaging or summing the word2vec embeddings of a document performs considerably better than using the doc2vec vectors. I also tried with much more doc2vec iterations (25, 80 and 150 - makes no difference).

Any tips or ideas why and how to improve doc2vec results?

Update: This is how doc2vec_tagged_documents is created:

doc2vec_tagged_documents = list()
counter = 0
for document in documents:
    doc2vec_tagged_documents.append(TaggedDocument(document, [counter]))
    counter += 1

Some more facts about my data:

Upvotes: 9

Views: 5521

Answers (1)

gojomo
gojomo

Reputation: 54153

Summing/averaging word2vec vectors is often quite good!

It is more typical to use 10 or 20 iterations with Doc2Vec, rather than the default 5 inherited from Word2Vec. (I see you've tried that, though.)

If your main interest is the doc-vectors – and not the word-vectors that are in some Doc2Vec modes co-trained – definitely try the PV-DBOW mode (dm=0) as well. It'll train faster and is often a top-performer.

If your corpus is very small, or the docs very short, it may be hard for the doc-vectors to become generally meaningful. (In some cases, decreasing the vector size may help.) But especially if window is a large proportion of the average doc size, what's learned by word-vectors and what's learned by the doc-vectors will be very, very similar. And since the words may get trained more times, in more diverse contexts, they may have more generalizable meaning – unless you have a larger collections of longer docs.

Other things that sometimes help improve Doc2Vec vectors for classification purposes:

  • re-inferring all document vectors, at the end of training, perhaps even using parameters different from infer_vector() defaults, such as infer_vector(tokens, steps=50, alpha=0.025) – while quite slow, this means all docs get vectors from the same final model state, rather than what's left-over from bulk training

  • where classification labels are known, adding them as trained doc-tags, using the capability of TaggedDocument tags to be a list of tags

  • rare words are essentially just noise to Word2Vec or Doc2Vec - so a min_count above 1, perhaps significatly higher, often helps. (Singleton words mixed in may be especially damaging to individual doc-ID doc-vectors that are also, by design, singletons. The training process is also, in competition to the doc-vector, trying to make those singleton word-vectors predictive of their single-document neighborhoods... when really, for your purposes, you just want the doc-vector to be most descriptive. So this suggests both trying PV-DBOW, and increasing min_count.)

Hope this helps.

Upvotes: 22

Related Questions