Reputation: 768
I am using following code for LatentDirichletAllocation through scikit of python library:
tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2, stop_words='english')
tf = tf_vectorizer.fit_transform(documents)
lda_model = LatentDirichletAllocation(n_components=10, max_iter=5,
learning_method='online', learning_offset=50.,random_state=0).fit(tf)
lda_W = lda_model.transform(tf)
lda_H = lda_model.components_
when I print shape of lda_H it returns (10, 236), I understand 10 is for topic numbers and 236 are the words. I wish to see the effect of alpha on this so i changed the above code to :
lda_model = LatentDirichletAllocation(n_components=10,doc_topic_prior=.01, max_iter=5,
learning_method='online', learning_offset=50.,random_state=0).fit(tf)
lda_W = lda_model.transform(tf)
lda_H = lda_model.components_
however i found there is no effect of alpha on the words in the Topics and lda_H still returns (10, 236). I wonder why alpha does not change the words in the topic. i tried different values of the alpha but no changed is observed in lda_H Please any comments on it is appreciated.
Upvotes: 0
Views: 395
Reputation: 1103
Alpha is a parameter that controls the shape of the per-document topic distributions and does not influence the number of topics. The number of topics is not inferred, but fixed a priori by no_topics
.
Each document is always a mixture distribution over all topics, and alpha controls the distribution of probabilities over all topics for each document. We can set it according to whether we a priori expect each document to be a relatively even mixture over all topics, or whether we expect the majority of the probability to be allocated to a smaller set of topics per document.
The changes with alpha should reflect in the return from the transform
call, which you have assigned to lda_W
. This gives the matrix of per-document topic distributions. It will still be the same shape: (n_samples, n_topics)
, but you should see changes in the average spread of probabilities for each row (document). You could measure this, for instance, by setting a threshold probability and checking the number of topics per document which exceed this probability, averaged across all documents, comparing for the two values of alpha.
The topic distribution is also inferred as a mixture distribution over all words, so the number of words won't change, but rather the probability allocated to each word per topic.
It's well worth giving the original paper on LDA a read for a more in-depth explanation of what the algorithm is doing.
Upvotes: 1