Lin Endian
Lin Endian

Reputation: 103

doc2vec get most similar document

I am struggling to understand the usage of doc2vec. I trained a toy model on a set of documents using some sample code I saw on googling. Next I want to find the document that the model considers to be the closest match to documents in my training data. Say my document is "This is a sample document".

test_data = word_tokenize("This is a sample document".lower())
v = model.infer_vector(test_data)
print(v)
# prints a numpy array.

# to find most similar doc using tags
similar_doc = model.docvecs.most_similar('1')
print(similar_doc)
# prints [('0', 0.8838234543800354), ('1', 0.875300943851471), ('3', 
#          0.8752948641777039), ('2', 0.865660548210144)]

I searched a fair bit but I am confused how to interpret similar_doc. I want to answer the question: "which documents in my training data most closely match the document 'This is a sample document'", so how do I map the similar_doc output back to the training data? I did not understand the array of tuples, the second half of each tuple must be a probability but what are '0', '1' etc?

Upvotes: 2

Views: 3166

Answers (1)

gojomo
gojomo

Reputation: 54153

When supplied with a doc-tag known from training, most_similar() will return a list of the 10 most similar document-tags, with their cosine-similarity scores. To then get the vectors, you'd look up the returned tags:

vector_for_1 = model.docvecs['1']

The model doesn't store the original texts; if you need to look them up, you'll need to remember your own association of tags-to-texts.

Important notes:

  • Doc2Vec/Word2Vec don't work well with toy-sized examples: the useful relative positioning of final vectors needs lots of diverse examples. (You can sometimes squeeze out middling results from small datasets, as is done in some gensim test-cases and beginner demos, by using much-smaller vectors and many more training iterations – but even there, that code is using hundreds of texts each with hundreds of words.)

  • Be careful copying training code from random sites, lots of such examples are very broken or out-of-date.

  • infer_vector() usually benefits from using a larger value of steps than the default of 5, especially for short texts. It also often works better with a non-default starting alpha, such as 0.025 (the same as the training default).

Upvotes: 1

Related Questions