random
random

Reputation: 33

Gensim LDA alpha-parameter

I tried the three default-options for alpha in gensim's lda implementation and now wonder about the result: The sum of topic-probabilities over all documents is smaller than the number of documents in the corpus (see below). For example alpha = 'symmetric' yields about 9357 as sum of topic-probabilities, however, the number of topics is 9459. Could one tell me the reason for this unexpected result?

alpha = symmetric
nr_of_docs = 9459
sum_of_topic_probs = 9357.12285605

alpha = asymmetric
nr_of_docs = 9459
sum_of_topic_probs = 9375.29253851

alpha = auto
nr_of_docs = 9459
sum_of_topic_probs = 9396.40123459

Upvotes: 2

Views: 6204

Answers (2)

Ybing
Ybing

Reputation: 21

I think the problem is as default setting, the minimum_probability is set to 0.01 not 0.00.

You can check out the LDA model code here:

Therefore if you are training your model with the default setting, it might not return a sum of 1.00 when adding up the prob across topics for a specific document.

Since the minimum_probability is passed in here, you can always change it by something like this to reset it:

your_lda_model_name.minimum_probability = 0.0

Upvotes: 2

Jérôme Bau
Jérôme Bau

Reputation: 707

I tried to replicate your problem but in my case (using a very small corpus), I could not find any difference between the three sums.
I will still share the paths I tried in the case anybody else wants to replicate the problem ;-)

I use some small example from gensim's website and train the three different LDA models:

from gensim import corpora, models
texts = [['human', 'interface', 'computer'],
         ['survey', 'user', 'computer', 'system', 'response', 'time'],
         ['eps', 'user', 'interface', 'system'],
         ['system', 'human', 'system', 'eps'],
         ['user', 'response', 'time'],
         ['trees'],
         ['graph', 'trees'],
         ['graph', 'minors', 'trees'],
         ['graph', 'minors', 'survey']]

dictionary = corpora.Dictionary(texts)

corpus = [dictionary.doc2bow(text) for text in texts]

lda_sym = models.ldamodel.LdaModel(corpus=corpus, id2word=dictionary, num_topics=10, update_every=1,
                                      chunksize =100000, passes=1, alpha='symmetric')
lda_asym = models.ldamodel.LdaModel(corpus=corpus, id2word=dictionary, num_topics=10, update_every=1,
                                      chunksize =100000, passes=1, alpha='asymmetric')
lda_auto = models.ldamodel.LdaModel(corpus=corpus, id2word=dictionary, num_topics=10, update_every=1,
                                      chunksize =100000, passes=1, alpha='auto')

Now I sum over the topic probabilities for all documents (9 documents in total)

counts = {}
for model in [lda_sym, lda_asym, lda_auto]:
    s = 0
    for doc_n in range(len(corpus)):
        s += pd.DataFrame(lda_sym[corpus[doc_n]])[1].sum()
        if s < 1:
            print('Sum smaller than 1 for')
            print(model, doc_n)
    counts[model] = s

And indeed the sums are always 9:

counts = {<gensim.models.ldamodel.LdaModel at 0x7ff3cd1f3908>: 9.0,
          <gensim.models.ldamodel.LdaModel at 0x7ff3cd1f3048>: 9.0,
          <gensim.models.ldamodel.LdaModel at 0x7ff3cd1f3b70>: 9.0}

Of course that's not a representative example since it's so small. So if you could, maybe provide some more details about your corpus.

In general I would assume that this should always be the case. My first intuition was that maybe empty documents would change the sum, but that is also not the case, since empty documents just yield a topic distribution identical to alpha (which makes sense):

pd.DataFrame(lda_asym[[]])[1]

returns

0    0.203498
1    0.154607
2    0.124657
3    0.104428
4    0.089848
5    0.078840
6    0.070235
7    0.063324
8    0.057651
9    0.052911

which is identical to

lda_asym.alpha

array([ 0.20349777,  0.1546068 ,  0.12465746,  0.10442834,  0.08984802,
    0.0788403 ,  0.07023542,  0.06332404,  0.057651  ,  0.05291085])

which also sums to 1.

From a theoretical point of view, choosing different alphas will yield to completely different LDA models.

Alpha is the hyper parameter for the Dirichlet prior. The Dirichlet prior is the distribution from which we draw theta. And theta becomes the parameter that decides what shape the topic distribution is. So essentially, alpha influences how we draw topic distributions. That is why choosing different alphas will also give you slightly different results for

lda.show_topics()

But I do not see why the sum over document probabilities should differ from 1 for any LDA model or any kind of document.

Upvotes: 3

Related Questions