Reputation: 538
Suppose I build a LDA topic model using gensim or sklearn and assign top topics to each document. But some of documents don't match top topics assigned. Besides trying out different numbers of topics or use coherence score to get the optimal number of topics, what other tricks can I use to improve my model?
Upvotes: 0
Views: 2670
Reputation: 3005
A sample python implementation for @rchurch4's answer:
We can try out a different number of topics, and different values of alpha and beta(eta) to increase the coherence score. High coherence score is good for our model
def calculate_coherence_score(n, alpha, beta):
lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
id2word=id2word,
num_topics=n,
random_state=100,
update_every=1,
chunksize=100,
passes=10,
alpha=alpha,
per_word_topics=True,
eta = beta)
coherence_model_lda = CoherenceModel(model=lda_model, texts=tokeize_article, dictionary=id2word, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
return coherence_lda
#list containing various hyperparameters
no_of_topics = [2,5,7,10,12,14]
alpha_list = ['symmetric',0.3,0.5,0.7]
beta_list = ['auto',0.3,0.5,0.7]
for n in no_of_topics:
for alpha in alpha_list:
for beta in beta_list:
coherence_score = calculate_coherence_score(n, alpha, beta)
print(f"n : {n} ; alpha : {alpha} ; beta : {beta} ; Score : {coherence_score}")
Upvotes: 0
Reputation: 899
LDA also (semi-secretly) takes the parameters alpha
and beta
. Think of alpha
as the parameter that tells LDA how many topics each document should be generated from. beta
is the parameter that tells LDA how many topics each word
should be in. You can play with these and you may get better results.
However, LDA is an unsupervised model, and even the perfect settings for k
, alpha
, and beta
will result in some incorrectly assigned documents. If your data isn't preprocessed well, it almost doesn't matter what you assign the parameters, it will always produce poor results.
Upvotes: 0