Reputation: 61
I'm trying to use gensim's lda
model. If I create the lda model with a given corpus, and then I want to update it with a new corpus that contains words that aren't seen in the first corpus, how do I do this? When I try to just call lda_model.update(new_corpus)
, I get the following error:
/Library/Python/2.7/site-packages/gensim/models/ldamodel.pyc in inference(self, chunk, collect_sstats)
361 Elogthetad = Elogtheta[d, :]
362 expElogthetad = expElogtheta[d, :]
-->363 expElogbetad = self.expElogbeta[:, ids]
364
365 # The optimal phi_{dwk} is proportional to expElogthetad_k * expElogbetad_w.
IndexError: index 57 is out of bounds for axis 1 with size 57
I initialized lda_model with a corpus consisting of only 57 words, so that's why we see the size 57
bound. Then I wanted to call update on it with a corpus of many more words, and this fails.
How do I get around this? I want to be able to update my lda model with a new corpus with new words is this possible?
Upvotes: 4
Views: 2888
Reputation: 4266
No, you must use the same dictionary (mapping between words and their integer ids) for both training, updates and inference.
Which means you can update the model with new documents, but not with new word types.
Check out the HashDictionary class which uses the "hashing trick" to work around this limitation (but the hashing trick comes with its own caveats).
Upvotes: 3