aburkov
aburkov

Reputation: 13

How to get most similar words to a document in gensim doc2vec?

I have built a gensim Doc2vec model. Let's call it doc2vec. Now I want to find the most relevant words to a given document according to my doc2vec model.

For example, I have a document about "java" with the tag "doc_about_java". When I ask for similar documents, I get documents about other programming languages and topics related to java. So my document model works well.

Now I want to find the most relevant words to "doc_about_java".

I follow the solution from the closed question How to find most similar terms/words of a document in doc2vec? and it gives me seemingly random words, the word "java" is not even among the first 100 similar words:

docvec = doc2vec.docvecs['doc_about_java']
print doc2vec.most_similar(positive=[docvec], topn=100)

I also tried like this:

print doc2vec.wv.similar_by_vector(doc2vec["doc_about_java"])

but it didn't change anything. How can I find the most similar words to a given document?

Upvotes: 0

Views: 3090

Answers (1)

gojomo
gojomo

Reputation: 54243

Not all Doc2Vec modes even train word-vectors. In particular, the PV-DBOW mode dm=0, which often works very well for doc-vector comparisons, leaves word-vectors at randomly-assigned (and unused) positions.

So that may explain why the results of your initial attempt to get a list-of-related-words seem random.

To get word-vectors, you'd need to use PV-DM mode (dm=1), or add optional concurrent word-vector training to PV-DBOW (dm=0, dbow_words=1).

(If this isn't the issue, there maybe other problems in your training setup, so you should show more detail about your data source, size, and code.)

(Separately, your alternate attempt code-line, by using doc2vec["doc_about_java"] is retrieving a word-vector for "doc_about_java" (which may not be present at all). To get the doc-vector, use doc2vec.docvecs["doc_about_java"], as in your first code block.)

Upvotes: 2

Related Questions