smwikipedia
smwikipedia

Reputation: 64213

How to determine the number of topics in the LDA (Latent Dirichlet Allocation) alogrithm for text clustering?

I am using the LDA algorithm to cluster many documents into different topics. The LDA algorithm needs an input parameter: the number of topics. How could I determine this?

I am using the Reuter corpora to benchmark my solution. And Reuter corpora has topic numbers ready. Should I input the the same topic number when I clustering Reuter text? And comparing my clustering result to Reuter's?

But when in production, how could I know the number of topics before I actually cluster based on the topics. It's kind of like a chicken-egg problem.

Upvotes: 4

Views: 6306

Answers (1)

Clock Slave
Clock Slave

Reputation: 7967

One way you can approach this is through k means. Through Silhouette (or the elbow curves, but I guess that will require manual intervention) you can get the optimal number of clusters. You can use this number as the number of topics.

Upvotes: 1

Related Questions