mcAngular2
mcAngular2

Reputation: 309

Bayesian Hyperparameter Optimization

I've done some experimennts with bayesian hyperparameter optimization for my lstm hyperparameters.

I use a approach where you model the error with an gaussian process and with a TPE algorithm. They are working pretty good.

I'm wondering where these strategies are called "bayesian". Can anyone explain what "bayesian" mean in the context of hyperparameter optimization?

Thanks

Upvotes: 1

Views: 635

Answers (1)

user3658307
user3658307

Reputation: 801

Well, firstly, Gaussian processes fall under the domain of Non-parametric Bayesian learning models, meaning they are generally considered Bayesian models. On the other hand, the Tree-structured Parzen Estimator fundamentally relies on Bayes rule: it models p(x|y) and p(y), which we can use to obtain p(y|x) via Bayes rule.

But regardless when people refer to Bayesian optimization, they are more talking about the search approach itself. Something is Bayesian if it involves (1) a probabilistic prior belief and (2) a principled way to update one's beliefs when new evidence is acquired. GPs, for instance, form a prior over functions, as well as a way to update the posterior (the new distribution after new evidence is acquired), which is exactly what we want for Bayesian ML.

Usually what is done is to start with a Bayesian prior over the (hyper)parameter space (encoding your prior beliefs about what the performance should be). We define an acquisition function a(x), which helps us choose which parameters to look at next. Since we have a probabilistic Bayesian model, we have a notion of uncertainty: e.g., we might know the variance in the predictive distribution of our model at a particular point. At points far from our observations, the variance will be high, while at points near our observations, the variance will be low. We have a distribution p(y|x), in other words. This explicit accounting for uncertainty is a huge benefit of Bayesian approaches.

The acquisition functiona(x) usually has to balance two factors: (1) uncertainty, since in uncertain areas there may be "hidden gems" that we haven't seen yet, and (2) proven performance (i.e., we should stay near the areas of space we have observed that we know are good). One might therefore design a(x) to minimize the entropy (uncertainty) in the distribution, or to maximize Bayesian surprise, meaning "choose points that cause the maximum change in the posterior distribution upon observation". Similar methods are used for exploration in reinforcement learning (search for "Bayesian surprise" or "curiosity"); any such methods involving updating "posterior beliefs" is generally considered Bayesian.

TLDR: they are Bayesian because they involve starting with a prior and iteratively updating the posterior probabilities as beliefs.

Upvotes: 0

Related Questions