Reputation: 187
When I train Doc2vec (using Gensim's Doc2vec in Python) on corpus of about 10k documents (each has few hundred words) and then infer document vectors using the same documents, they are not at all similar to the trained document vectors. I would expect they would be at least somewhat similar.
That is I do model.docvecs['some_doc_id']
and model.infer_vector(documents['some_doc_id'])
.
Cosine distances between trained and inferred vectors for few first documents:
0.38277733326
0.284007549286
0.286488652229
0.173178792
0.370117008686
0.275438070297
0.377647638321
0.171194493771
0.350615143776
0.311795353889
0.342757165432
As you can see, they are not really similar. If the similarity is so terrible even for documents used for training, I can't even begin to try to infer unseen documents.
Training configuration:
model = Doc2Vec(documents=documents, dm=1, size=100, window=6, alpha=0.1, workers=4,
seed=44, sample=1e-5, iter=15, hs=0, negative=8, dm_mean=1, min_alpha=0.01, min_count=2)
Inferring:
model.infer_vector(tokens, steps=20, alpha=0.025)
Note on the side: Documents are always preprocessed the same way (I checked that the same list of tokens goes into training and into inferring).
Also I played with parameters around a bit, too, and results were similar. So if your suggestion would be something like "try increasing or decreasing this or that training parameter", I've most likely tried it. Maybe I just didn't come across the 'correct' parameters though.
Thanks for any suggestions as to what can I do to make it work better.
EDIT: I am willing and able to use any other available Python implementation of paragraph vectors (doc2vec). It doesn't have to be this one. If you know of another that can achieve better results.
EDIT: Minimal working example
import fnmatch
import os
from scipy.spatial.distance import cosine
from gensim.models import Doc2Vec
from gensim.models.doc2vec import TaggedDocument
from keras.preprocessing.text import text_to_word_sequence
files = {}
folder = 'some path' # each file contains few regular sentences
for f in fnmatch.filter(os.listdir(folder), '*.sent'):
files[f] = open(folder + '/' + f, 'r', encoding="UTF-8").read()
documents = []
for k, v in files.items():
words = text_to_word_sequence(v, lower=True) # converts string to list of words, removes commas etc.
documents.append(TaggedDocument(tags=[k], words=words))
d2 = Doc2Vec(size=200, documents=documents)
for doc in documents:
trained = d2.docvecs[doc.tags[0]]
inferred = d2.infer_vector(doc.words, steps=50)
print(cosine(trained, inferred)) # cosine similarity from scipy
Upvotes: 3
Views: 2159
Reputation: 54153
What is the type of your documents
object, and are you sure that it is a multiply-iterable object, so that the model can do all of its 16 passes over the set of TaggedDocument
-shaped text examples? That is, does iter(documents)
always return a fresh iterator, with all items as TaggedDocument
-shaped objects with the right list-of-words in words
and list-of-tags in tags
? (A common error is to supply a corpus that can be iterated over only once, and then ignoring any logged hints/warnings that no real training has happening. The inference/similarity results from such a model will be essentially random.)
Then for infer_vector()
, does documents[tag]
really return just the list-of-words it expects (not TaggedDocument
or string)? (Users often supply strings, rather than lists-of-tokens, for training or inference words
and get results that are just noise.)
Was there evaluation-guided reason for changing various defaults, either a little (window=6
, negative=8
) or a lot (alpha=0.1
, min_count=2
)? Such tweaks may not be a major factor in your problem, and there's nothing magical about the class defaults. But until you have the basics working, it's best to stick close to common configuration. (And then even after the basics are working, limit changes to those that can be demonstrated as better via a repeatable scoring process.)
Some report needing much higher steps
values – 100 or more – to get better inference results, though that would be most crucial for very-small documents (of a handful to couple dozen words) rather than the few-hundred-words documents you describe.
A corpus of 10k documents is on the small side for Paragraph Vectors (Doc2Vec
), but with your smallish vector-size (100) and larger number of iterations (15), it might be workable.
If you're still having problems, you should expand your question with more code showing how documents
works, some suggestive example documents, and your cosine-similarity evaluation process – to see if there are any oversights at each of those steps.
Upvotes: 2