Animesh Pandey
Animesh Pandey

Reputation: 6018

Applying LDA to a corpus for training using gensim

I have corpus with around 20,000 documents and I have to train that data set for topic modelling using LDA.

import logging, gensim

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
id2word = gensim.corpora.Dictionary('questions.dict')
mm = gensim.corpora.MmCorpus('questions.mm')
lda = gensim.models.ldamodel.LdaModel(corpus=mm, id2word=id2word, num_topics=100, update_every=0, chunksize=3000, passes=20)
lda.print_topics(20)

Whenever I run this program I come across this error:

2013-04-28 09:57:09,750 : INFO : adding document #0 to Dictionary(0 unique tokens)
2013-04-28 09:57:09,759 : INFO : built Dictionary(11 unique tokens) from 14 documents (total 14 corpus positions)
2013-04-28 09:57:09,785 : INFO : loaded corpus index from questions.mm.index
2013-04-28 09:57:09,790 : INFO : initializing corpus reader from questions.mm
2013-04-28 09:57:09,796 : INFO : accepted corpus with 19188 documents, 15791 features, 106222 non-zero entries
2013-04-28 09:57:09,802 : INFO : using serial LDA version on this node
2013-04-28 09:57:09,808 : INFO : running batch LDA training, 100 topics, 20 passes over the supplied corpus of 19188 documents, updating model once every 19188 documents
2013-04-28 09:57:10,267 : INFO : PROGRESS: iteration 0, at document #3000/19188

Traceback (most recent call last):
File "C:/Users/Animesh/Desktop/NLP/topicmodel/lda.py", line 10, in <module>
lda = gensim.models.ldamodel.LdaModel(corpus=mm, id2word=id2word, num_topics=100, update_every=0, chunksize=3000, passes=20)
File "C:\Python27\lib\site-packages\gensim-0.8.6-py2.7.egg\gensim\models\ldamodel.py", line 265, in __init__
self.update(corpus)
File "C:\Python27\lib\site-packages\gensim-0.8.6-py2.7.egg\gensim\models\ldamodel.py", line 445, in update
self.do_estep(chunk, other)
File "C:\Python27\lib\site-packages\gensim-0.8.6-py2.7.egg\gensim\models\ldamodel.py", line 365, in do_estep
gamma, sstats = self.inference(chunk, collect_sstats=True)
File "C:\Python27\lib\site-packages\gensim-0.8.6-py2.7.egg\gensim\models\ldamodel.py", line 318, in inference
expElogbetad = self.expElogbeta[:, ids]
IndexError: index (11) out of range (0<=index<10) in dimension 1

I even tried to change the values in LdaModel function but I always get the same error!

What should be done ?

Upvotes: 2

Views: 1875

Answers (1)

benlaird
benlaird

Reputation: 879

It seems that your dictionary (id2word) is not correctly matched up with your corpus object (mm).

For whatever reason, id2word (the mapping of word tokens to wordids) only contains 11 tokens 2013-04-28 09:57:09,759 : INFO : built Dictionary(11 unique tokens) from 14 documents (total 14 corpus positions)

Your corpus contains 15791 features, so when it looks for a feature with id > 10, it fails. ids in expElogbetad = self.expElogbeta[:, ids] is a list of all the word ids in a particular document.

I'd rerun the creation of the corpus and dictionary:

$ python -m gensim.scripts.make_wiki (from the gensim LDA tutorial).

The logging data for the created dictionary should indicate far more than 11 tokens I believe. I've run into a similar problem to this myself.

Upvotes: 2

Related Questions