Reputation: 103
I am struggling to understand the usage of doc2vec. I trained a toy model on a set of documents using some sample code I saw on googling. Next I want to find the document that the model considers to be the closest match to documents in my training data. Say my document is "This is a sample document".
test_data = word_tokenize("This is a sample document".lower())
v = model.infer_vector(test_data)
print(v)
# prints a numpy array.
# to find most similar doc using tags
similar_doc = model.docvecs.most_similar('1')
print(similar_doc)
# prints [('0', 0.8838234543800354), ('1', 0.875300943851471), ('3',
# 0.8752948641777039), ('2', 0.865660548210144)]
I searched a fair bit but I am confused how to interpret similar_doc. I want to answer the question: "which documents in my training data most closely match the document 'This is a sample document'", so how do I map the similar_doc output back to the training data? I did not understand the array of tuples, the second half of each tuple must be a probability but what are '0', '1' etc?
Upvotes: 2
Views: 3166
Reputation: 54153
When supplied with a doc-tag known from training, most_similar()
will return a list of the 10 most similar document-tags, with their cosine-similarity scores. To then get the vectors, you'd look up the returned tags:
vector_for_1 = model.docvecs['1']
The model doesn't store the original texts; if you need to look them up, you'll need to remember your own association of tags-to-texts.
Important notes:
Doc2Vec/Word2Vec don't work well with toy-sized examples: the useful relative positioning of final vectors needs lots of diverse examples. (You can sometimes squeeze out middling results from small datasets, as is done in some gensim test-cases and beginner demos, by using much-smaller vectors and many more training iterations – but even there, that code is using hundreds of texts each with hundreds of words.)
Be careful copying training code from random sites, lots of such examples are very broken or out-of-date.
infer_vector()
usually benefits from using a larger value of steps
than the default of 5, especially for short texts. It also often works better with a non-default starting alpha
, such as 0.025 (the same as the training default).
Upvotes: 1