Reputation: 4869
I'm looking for a technique similar to LDA but where it is unknown how many "mixtures" are optimal -- is there anything out there that can do that?
Upvotes: 3
Views: 2576
Reputation: 7394
As Byron said, the simplest way to do this is to compare likelihoods for different values of k. However, if you take care to consider the probability of some held-out data (i.e. not used to induce the model), this naturally penalises overfitting and so you don't need to normalise for k. A simple way to do this is to take your training data and split it into a training set and a dev set, and do a search over a range of plausible k values, inducing models from the training set and then computing dev set probability given the induced model.
It's worth mentioning that computing the likelihood exactly under LDA is intractable, so you're going to need to use approximate inference. This paper goes into this in depth, but if you use a standard LDA package (I'd recommend mallet: http://mallet.cs.umass.edu/) they should have this functionality already.
The non-parametric version is indeed the correct way to go, but inference in non-parametric models is computationally expensive, so I would hesitate to pursue this unless the above doesn't work.
Upvotes: 2
Reputation: 151
There are two ways of going about this, one hacky but easy; the other better motivated but more complex. Starting with the former, one could simply try a range of k (number of topics) and compare the likelihoods of the observed data under each of these. You would probably want to penalize for larger number of topics, depending on your situation -- or you could explicitly place a prior distribution over k (i.e., a normal centered about the subjectively expected number of clusters). In any case you would simply select the k that maximizes the likelihood.
The more principled approach is to use Bayesian nonparametrics, and Dirichlet processes in particular in the case of topic models. Have a look at this paper. I do believe there is an implementation available here, though I haven't much looked into it.
Upvotes: 6