How to find the best measures for lda

Question

Using the example for lda from quanteda package

require(quanteda)
require(quanteda.corpora)
require(lubridate)
require(topicmodels)
corp_news <- download('data_corpus_guardian')
corp_news_subset <- corpus_subset(corp_news, 'date' >= 2016)
ndoc(corp_news_subset)
dfmat_news <- dfm(corp_news, remove_punct = TRUE, remove = stopwords('en')) %>% 
    dfm_remove(c('*-time', '*-timeUpdated', 'GMT', 'BST')) %>% 
    dfm_trim(min_termfreq = 0.95, termfreq_type = "quantile", 
             max_docfreq = 0.1, docfreq_type = "prop")

dfmat_news <- dfmat_news[ntoken(dfmat_news) > 0,]
dtm <- convert(dfmat_news, to = "topicmodels")
lda <- LDA(dtm, k = 10)

Is there any metrics that can help to understand the appropriate number of topics? I need this as my texts are small and don't know if the performance is right. Also is there any way to have a performance measure (i.e precision/recall) to measure the better performance of lda with different features?

Francesco Grossetti · Accepted Answer

There are several Goodness-of-Fit (GoF) metrics you can use to assess a LDA model. The most common is called perplexity which you can compute trough the function perplexity() in the package topicmodels. The way you select the optimal model is to look for a "knee" in the plot. The idea, stemming from unsupervised methods, is to run multiple LDA models with different topics. As the number of topics increases, you should see the perplexity decrease. You want to stop either when you find a knee or when the incremental decrease is negligible. Thin about the scree plot when you run the Principal Component Analysis.

Having said that, there is an R package called ldatuning which implements four additional metrics based on density-based clustering and on Kullback-Leibler divergence. Three of them can be used with both VEM and Gibbs inference, while the method by Griffith can only be used with Gibbs. For some of these metrics you look for the minimum, for other for the maximum. Also, you can always compute the log-likelihood of your model which want to maximize. The way you can extract the likelihood from an LDA object is pretty straightforward. Let's assume you have an LDA model called ldamodel:

loglikelihood = as.numeric(logLik(ldamodel))

There is a lot of research around this topic. For instance, you can have a look at these papers:

In addition, you can have a look at the preprint of a paper I am working on with a colleague of mine which uses simple parametric tests to evaluate GoF. We also developed an R package which can be use over a list of LDA models of class LDA from topicmodels. You can find the paper here and the package here. You are more than welcome to submit any issue you may find in the package. The paper is under reviewed at the moment, but again, comments are more than welcome.

Hope this helps!

How to find the best measures for lda

Answers (1)

Related Questions