Reputation: 33
I tried the three default-options for alpha in gensim's lda implementation and now wonder about the result: The sum of topic-probabilities over all documents is smaller than the number of documents in the corpus (see below). For example alpha = 'symmetric' yields about 9357 as sum of topic-probabilities, however, the number of topics is 9459. Could one tell me the reason for this unexpected result?
alpha = symmetric
nr_of_docs = 9459
sum_of_topic_probs = 9357.12285605
alpha = asymmetric
nr_of_docs = 9459
sum_of_topic_probs = 9375.29253851
alpha = auto
nr_of_docs = 9459
sum_of_topic_probs = 9396.40123459
Upvotes: 2
Views: 6204
Reputation: 21
I think the problem is as default setting, the minimum_probability
is set to 0.01
not 0.00
.
You can check out the LDA model code here:
Therefore if you are training your model with the default setting, it might not return a sum of 1.00 when adding up the prob across topics for a specific document.
Since the minimum_probability
is passed in here, you can always change it by something like this to reset it:
your_lda_model_name.minimum_probability = 0.0
Upvotes: 2
Reputation: 707
I tried to replicate your problem but in my case (using a very small corpus), I could not find any difference between the three sums.
I will still share the paths I tried in the case anybody else wants to replicate the problem ;-)
I use some small example from gensim's website and train the three different LDA models:
from gensim import corpora, models
texts = [['human', 'interface', 'computer'],
['survey', 'user', 'computer', 'system', 'response', 'time'],
['eps', 'user', 'interface', 'system'],
['system', 'human', 'system', 'eps'],
['user', 'response', 'time'],
['trees'],
['graph', 'trees'],
['graph', 'minors', 'trees'],
['graph', 'minors', 'survey']]
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]
lda_sym = models.ldamodel.LdaModel(corpus=corpus, id2word=dictionary, num_topics=10, update_every=1,
chunksize =100000, passes=1, alpha='symmetric')
lda_asym = models.ldamodel.LdaModel(corpus=corpus, id2word=dictionary, num_topics=10, update_every=1,
chunksize =100000, passes=1, alpha='asymmetric')
lda_auto = models.ldamodel.LdaModel(corpus=corpus, id2word=dictionary, num_topics=10, update_every=1,
chunksize =100000, passes=1, alpha='auto')
Now I sum over the topic probabilities for all documents (9 documents in total)
counts = {}
for model in [lda_sym, lda_asym, lda_auto]:
s = 0
for doc_n in range(len(corpus)):
s += pd.DataFrame(lda_sym[corpus[doc_n]])[1].sum()
if s < 1:
print('Sum smaller than 1 for')
print(model, doc_n)
counts[model] = s
And indeed the sums are always 9:
counts = {<gensim.models.ldamodel.LdaModel at 0x7ff3cd1f3908>: 9.0,
<gensim.models.ldamodel.LdaModel at 0x7ff3cd1f3048>: 9.0,
<gensim.models.ldamodel.LdaModel at 0x7ff3cd1f3b70>: 9.0}
Of course that's not a representative example since it's so small. So if you could, maybe provide some more details about your corpus.
In general I would assume that this should always be the case. My first intuition was that maybe empty documents would change the sum, but that is also not the case, since empty documents just yield a topic distribution identical to alpha (which makes sense):
pd.DataFrame(lda_asym[[]])[1]
returns
0 0.203498
1 0.154607
2 0.124657
3 0.104428
4 0.089848
5 0.078840
6 0.070235
7 0.063324
8 0.057651
9 0.052911
which is identical to
lda_asym.alpha
array([ 0.20349777, 0.1546068 , 0.12465746, 0.10442834, 0.08984802,
0.0788403 , 0.07023542, 0.06332404, 0.057651 , 0.05291085])
which also sums to 1.
From a theoretical point of view, choosing different alphas will yield to completely different LDA models.
Alpha is the hyper parameter for the Dirichlet prior. The Dirichlet prior is the distribution from which we draw theta. And theta becomes the parameter that decides what shape the topic distribution is. So essentially, alpha influences how we draw topic distributions. That is why choosing different alphas will also give you slightly different results for
lda.show_topics()
But I do not see why the sum over document probabilities should differ from 1 for any LDA model or any kind of document.
Upvotes: 3