user3121136
user3121136

Reputation: 61

gensim lda model - calling update on a corpus with unseen words

I'm trying to use gensim's lda model. If I create the lda model with a given corpus, and then I want to update it with a new corpus that contains words that aren't seen in the first corpus, how do I do this? When I try to just call lda_model.update(new_corpus), I get the following error:

/Library/Python/2.7/site-packages/gensim/models/ldamodel.pyc in inference(self, chunk, collect_sstats)
    361             Elogthetad = Elogtheta[d, :]
    362             expElogthetad = expElogtheta[d, :]
 -->363             expElogbetad = self.expElogbeta[:, ids]
    364 
    365             # The optimal phi_{dwk} is proportional to expElogthetad_k * expElogbetad_w.
   IndexError: index 57 is out of bounds for axis 1 with size 57

I initialized lda_model with a corpus consisting of only 57 words, so that's why we see the size 57 bound. Then I wanted to call update on it with a corpus of many more words, and this fails.

How do I get around this? I want to be able to update my lda model with a new corpus with new words is this possible?

Upvotes: 4

Views: 2888

Answers (1)

Radim
Radim

Reputation: 4266

No, you must use the same dictionary (mapping between words and their integer ids) for both training, updates and inference.

Which means you can update the model with new documents, but not with new word types.

Check out the HashDictionary class which uses the "hashing trick" to work around this limitation (but the hashing trick comes with its own caveats).

Upvotes: 3

Related Questions