doc2vec get most similar document

Question

I am struggling to understand the usage of doc2vec. I trained a toy model on a set of documents using some sample code I saw on googling. Next I want to find the document that the model considers to be the closest match to documents in my training data. Say my document is "This is a sample document".

test_data = word_tokenize("This is a sample document".lower())
v = model.infer_vector(test_data)
print(v)
# prints a numpy array.

# to find most similar doc using tags
similar_doc = model.docvecs.most_similar('1')
print(similar_doc)
# prints [('0', 0.8838234543800354), ('1', 0.875300943851471), ('3', 
#          0.8752948641777039), ('2', 0.865660548210144)]

I searched a fair bit but I am confused how to interpret similar_doc. I want to answer the question: "which documents in my training data most closely match the document 'This is a sample document'", so how do I map the similar_doc output back to the training data? I did not understand the array of tuples, the second half of each tuple must be a probability but what are '0', '1' etc?

doc2vec get most similar document

Answers (1)

Related Questions