Doc2Vec Worse Than Mean or Sum of Word2Vec Vectors

Question

I'm training a Word2Vec model like:

model = Word2Vec(documents, size=200, window=5, min_count=0, workers=4, iter=5, sg=1)

and Doc2Vec model like:

doc2vec_model = Doc2Vec(size=200, window=5, min_count=0, iter=5, workers=4, dm=1)
doc2vec_model.build_vocab(doc2vec_tagged_documents)
doc2vec_model.train(doc2vec_tagged_documents, total_examples=doc2vec_model.corpus_count, epochs=doc2vec_model.iter)

with the same data and comparable parameters.

After this I'm using these models for my classification task. And I have found out that simply averaging or summing the word2vec embeddings of a document performs considerably better than using the doc2vec vectors. I also tried with much more doc2vec iterations (25, 80 and 150 - makes no difference).

Any tips or ideas why and how to improve doc2vec results?

Update: This is how doc2vec_tagged_documents is created:

doc2vec_tagged_documents = list()
counter = 0
for document in documents:
    doc2vec_tagged_documents.append(TaggedDocument(document, [counter]))
    counter += 1

Some more facts about my data:

My training data contains 4000 documents
with 900 words on average.
My vocabulary size is about 1000 words.
My data for the classification task is much smaller on average (12 words on average), but I also tried to split the training data to lines and train the doc2vec model like this, but it's almost the same result.
My data is not about natural language, please keep this in mind.

Doc2Vec Worse Than Mean or Sum of Word2Vec Vectors

Answers (1)

Related Questions