Reputation: 13
I have built a gensim Doc2vec model. Let's call it doc2vec. Now I want to find the most relevant words to a given document according to my doc2vec model.
For example, I have a document about "java" with the tag "doc_about_java". When I ask for similar documents, I get documents about other programming languages and topics related to java. So my document model works well.
Now I want to find the most relevant words to "doc_about_java".
I follow the solution from the closed question How to find most similar terms/words of a document in doc2vec? and it gives me seemingly random words, the word "java" is not even among the first 100 similar words:
docvec = doc2vec.docvecs['doc_about_java']
print doc2vec.most_similar(positive=[docvec], topn=100)
I also tried like this:
print doc2vec.wv.similar_by_vector(doc2vec["doc_about_java"])
but it didn't change anything. How can I find the most similar words to a given document?
Upvotes: 0
Views: 3090
Reputation: 54243
Not all Doc2Vec
modes even train word-vectors. In particular, the PV-DBOW mode dm=0
, which often works very well for doc-vector comparisons, leaves word-vectors at randomly-assigned (and unused) positions.
So that may explain why the results of your initial attempt to get a list-of-related-words seem random.
To get word-vectors, you'd need to use PV-DM mode (dm=1
), or add optional concurrent word-vector training to PV-DBOW (dm=0, dbow_words=1
).
(If this isn't the issue, there maybe other problems in your training setup, so you should show more detail about your data source, size, and code.)
(Separately, your alternate attempt code-line, by using doc2vec["doc_about_java"]
is retrieving a word-vector for "doc_about_java"
(which may not be present at all). To get the doc-vector, use doc2vec.docvecs["doc_about_java"]
, as in your first code block.)
Upvotes: 2