Max
Max

Reputation: 185

Gensim LDA : error cannot compute LDA over an empty collection (no terms)

I have te same error as this thread : ValueError: cannot compute LDA over an empty collection (no terms) but the solution needed isn't the same.

I'm working on a notebook with Sklearn, and I've done an LDA and a NMF.

I'm now trying to do the same using Gensim: https://radimrehurek.com/gensim/auto_examples/tutorials/run_lda.htm

Here is a piece of code (in Python) from my notebook of what I'm trying to do :

dic = gensim.corpora.Dictionary(texts_lem)
dic.filter_extremes(no_below=10, no_above=0.8)
corpus = [dic.doc2bow(doc) for doc in texts_lem]

model = gensim.models.LdaModel(
    corpus=corpus,
    id2word=dic.id2token,
    num_topics=10,
)

I'm using the existing texts_lem list from another section of my notebook to do the Gensim LDA. I'm following the guide : Creating a dictionary, filtering extremes, creating a corpus and sending it to LdaModel().

Unfortunately, it doesn't work, and commenting filter_extremes's row doesn't help (This is the answer of the other thread with same error).

texts_lem is the list of list of words like the following :

[
 ['word', 'word', 'word', 'word'],
 ['word', 'word', 'word', 'word'],
 ['word', 'word', 'word', 'word'],
]

My error is :

ValueError: cannot compute LDA over an empty collection (no terms)

Many thanks for your help.

Upvotes: 0

Views: 1306

Answers (2)

waijean
waijean

Reputation: 61

As shown in the gensim LDA tutorial, you need to "load" the dictionary before passing dictionary.id2token to the LdaModel. Using your example, the code should be

dic = gensim.corpora.Dictionary(texts_lem)
dic.filter_extremes(no_below=10, no_above=0.8)
corpus = [dic.doc2bow(doc) for doc in texts_lem]   

# Make a index to word dictionary.
temp = dic[0]  # This is only to "load" the dictionary.
id2word = dic.id2token

model = gensim.models.LdaModel(
    corpus=corpus,
    id2word=id2word,
    num_topics=10,
)

This is because id2token is initialized in a lazy manner to save memory (not created until needed). You can refer to the documentation here.

Upvotes: 2

Max
Max

Reputation: 185

Just don't use id2token.

Your model should be :

model = gensim.models.LdaModel(
corpus=corpus,
id2word=dic.id2token,
num_topics=10,
)

Works fine. Who knows what's going on ?

Upvotes: 0

Related Questions