Reputation: 538

Tuning LDA Topic Models

Suppose I build a LDA topic model using gensim or sklearn and assign top topics to each document. But some of documents don't match top topics assigned. Besides trying out different numbers of topics or use coherence score to get the optimal number of topics, what other tricks can I use to improve my model?

Upvotes: 0

Answers (2)

Ailurophile

Reputation: 3005

A sample python implementation for @rchurch4's answer:

We can try out a different number of topics, and different values of alpha and beta(eta) to increase the coherence score. High coherence score is good for our model

def calculate_coherence_score(n, alpha, beta):
    lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
                                           id2word=id2word,
                                           num_topics=n, 
                                           random_state=100,
                                           update_every=1,
                                           chunksize=100,
                                           passes=10,
                                           alpha=alpha,
                                           per_word_topics=True,
                                           eta = beta)
    coherence_model_lda = CoherenceModel(model=lda_model, texts=tokeize_article, dictionary=id2word, coherence='c_v')
    coherence_lda = coherence_model_lda.get_coherence()
    return coherence_lda

#list containing various hyperparameters
no_of_topics = [2,5,7,10,12,14]
alpha_list = ['symmetric',0.3,0.5,0.7]
beta_list = ['auto',0.3,0.5,0.7]


for n in no_of_topics:
    for alpha in alpha_list:
        for beta in beta_list:
            coherence_score = calculate_coherence_score(n, alpha, beta)
            print(f"n : {n} ; alpha : {alpha} ; beta : {beta} ; Score : {coherence_score}")

Upvotes: 0

rchurch4

Reputation: 899

LDA also (semi-secretly) takes the parameters alpha and beta. Think of alpha as the parameter that tells LDA how many topics each document should be generated from. beta is the parameter that tells LDA how many topics each word should be in. You can play with these and you may get better results.

However, LDA is an unsupervised model, and even the perfect settings for k, alpha, and beta will result in some incorrectly assigned documents. If your data isn't preprocessed well, it almost doesn't matter what you assign the parameters, it will always produce poor results.

Upvotes: 0

Tuning LDA Topic Models

Answers (2)

Related Questions