Reputation: 2082
I am training a doc2vec gensim model
with txt file 'full_texts.txt' that contains ~1600 documents. Once I have trained the model, I wish to use similarity methods over words and sentences.
However, since this is my first time using gensim , I am unable to get a solution. If I want to look for similarity by words I try as mentioned below but I get an error that the word doesnt exist in the vocabulary
and on the other question is how do I check similarity for entire documents? I have read a lot of questions around it, like this one and looked up documentation but still not sure what I am doing wrong.
from gensim.models import Doc2Vec
from gensim.models.doc2vec import TaggedLineDocument
from gensim.models.doc2vec import TaggedDocument
tagdocs = TaggedLineDocument('full_texts.txt')
d2v_mod = Doc2Vec(min_count=3,vector_size = 200, workers = 2, window = 5, epochs = 30,dm=0,dbow_words=1,seed=42)
d2v_mod.build_vocab(tagdocs)
d2v_mod.train(tagdocs,total_examples=d2v_mod.corpus_count,epochs=20)
d2v_mod.wv.similar_by_word('overdraft',topn=10)
KeyError: "word 'overdraft' not in vocabulary"
Upvotes: 2
Views: 1747
Reputation: 54153
Are you sure 'overdraft'
appears at least min_count=3
times in your corpus? (For example, what does grep -c " overdraft " full_texts.txt
return?)
(Note also that 1600 docs is a very-small corpus for Doc2Vec
purposes; published work typically uses at least tens-of-thousands of docs, and often millions.)
In general, if concerned about getting basic functionality working, good ideas are to:
follow trustworthy examples - the gensim docs/notebooks
directory includes several Jupyter/IPython notebooks demonstrating doc2vec functionality, including the minimal intro doc2vec-lee.ipynb
, also viewable online (but it's best to run locally so you can tinker with specifics to learn)
enable logging at the INFO level, and watch the output closely to make sure the various reported progress steps, including counts of words/docs and training durations, indicate everything is working sensibly
probe the resulting model for expected behavior. For example, is an expected word present in the learned vocabulary? 'overdrafts' in d2v_mod.wv
. How many document tags were learned? len(d2v_mod.docvecs)
. etc
Upvotes: 3