Chelsea Wang
Chelsea Wang

Reputation: 629

how to determine the number of topics for LDA?

I am a freshman in LDA and I want to use it in my work. However, some problems appear.

In order to get the best performance, I want to estimate the best topic number. After reading "Finding Scientific topics", I know that I can calculate logP(w|z) firstly and then use the harmonic mean of a series of P(w|z) to estimate P(w|T).

My question is what does the "a series of" mean?

Upvotes: 31

Views: 43809

Answers (5)

Karsten W.
Karsten W.

Reputation: 18440

Since I am working on that same problem, I just want to add the method proposed by Wang et al. (2019) in their paper "Optimization of Topic Recognition Model for News Texts Based on LDA". Besides giving a good overview, they suggest a new method. First you train a word2vec model (e.g. using the word2vec package), then you apply a clustering algorithm capable of finding density peaks (e.g. from the densityClust package), and then use the number of found clusters as number of topics in the LDA algorithm.

If time permits, I will try this out. I also wonder if the word2vec model can make the LDA obsolete.

flowchart from the paper

Upvotes: 0

Ashok Lathwal
Ashok Lathwal

Reputation: 349

Let k = number of topics

There is no single best way and I am not even sure if there is any standard practices for this.

Method 1: Try out different values of k, select the one that has the largest likelihood.

Method 2: Instead of LDA, see if you can use HDP-LDA

Method 3: If the HDP-LDA is infeasible on your corpus (because of corpus size), then take a uniform sample of your corpus and run HDP-LDA on that, take the value of k as given by HDP-LDA. For a small interval around this k, use Method 1.

Upvotes: 3

Selva
Selva

Reputation: 2113

A reliable way is to compute the topic coherence for different number of topics and choose the model that gives the highest topic coherence. But sometimes, the highest may not always fit the bill.

enter image description here

See this topic modeling example.

Upvotes: 14

Pooja
Pooja

Reputation: 51

First some people use harmonic mean for finding optimal no.of topics and i also tried but results are unsatisfactory.So as per my suggestion ,if you are using R ,then package"ldatuning" will be useful.It has four metrics for calculating optimal no.of parameters. Again perplexity and log-likelihood based V-fold cross validation are also very good option for best topic modeling.V-Fold cross validation are bit time consuming for large dataset.You can see "A heuristic approach to determine an appropriate no.of topics in topic modeling". Important links: https://cran.r-project.org/web/packages/ldatuning/vignettes/topics.html https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4597325/

Upvotes: 5

Chthonic Project
Chthonic Project

Reputation: 8356

Unfortunately, there is no hard science yielding the correct answer to your question. To the best of my knowledge, hierarchical dirichlet process (HDP) is quite possibly the best way to arrive at the optimal number of topics.

If you are looking for deeper analyses, this paper on HDP reports the advantages of HDP in determining the number of groups.

Upvotes: 16

Related Questions