Reputation: 629
I am a freshman in LDA and I want to use it in my work. However, some problems appear.
In order to get the best performance, I want to estimate the best topic number. After reading "Finding Scientific topics", I know that I can calculate logP(w|z) firstly and then use the harmonic mean of a series of P(w|z) to estimate P(w|T).
My question is what does the "a series of" mean?
Upvotes: 31
Views: 43809
Reputation: 18440
Since I am working on that same problem, I just want to add the method proposed by Wang et al. (2019) in their paper "Optimization of Topic Recognition Model for News Texts Based on LDA". Besides giving a good overview, they suggest a new method. First you train a word2vec model (e.g. using the word2vec
package), then you apply a clustering algorithm capable of finding density peaks (e.g. from the densityClust
package), and then use the number of found clusters as number of topics in the LDA algorithm.
If time permits, I will try this out. I also wonder if the word2vec model can make the LDA obsolete.
Upvotes: 0
Reputation: 349
Let k = number of topics
There is no single best way and I am not even sure if there is any standard practices for this.
Method 1: Try out different values of k, select the one that has the largest likelihood.
Method 2: Instead of LDA, see if you can use HDP-LDA
Method 3: If the HDP-LDA is infeasible on your corpus (because of corpus size), then take a uniform sample of your corpus and run HDP-LDA on that, take the value of k as given by HDP-LDA. For a small interval around this k, use Method 1.
Upvotes: 3
Reputation: 2113
A reliable way is to compute the topic coherence for different number of topics and choose the model that gives the highest topic coherence. But sometimes, the highest may not always fit the bill.
See this topic modeling example.
Upvotes: 14
Reputation: 51
First some people use harmonic mean for finding optimal no.of topics and i also tried but results are unsatisfactory.So as per my suggestion ,if you are using R ,then package"ldatuning" will be useful.It has four metrics for calculating optimal no.of parameters. Again perplexity and log-likelihood based V-fold cross validation are also very good option for best topic modeling.V-Fold cross validation are bit time consuming for large dataset.You can see "A heuristic approach to determine an appropriate no.of topics in topic modeling". Important links: https://cran.r-project.org/web/packages/ldatuning/vignettes/topics.html https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4597325/
Upvotes: 5
Reputation: 8356
Unfortunately, there is no hard science yielding the correct answer to your question. To the best of my knowledge, hierarchical dirichlet process (HDP) is quite possibly the best way to arrive at the optimal number of topics.
If you are looking for deeper analyses, this paper on HDP reports the advantages of HDP in determining the number of groups.
Upvotes: 16