Simulated Annealing - Intuition

Cross posted from csexchange:

Most versions of simulated annealing I've seen are implemented similar to what is outlined in the wikipedia pseudocode below:

Let s = s0
For k = 0 through kmax (exclusive):
  T ← temperature( 1 - (k+1)/kmax )
  Pick a random neighbour, snew ← neighbour(s)
  If P(E(s), E(snew), T) ≥ random(0, 1):
    s ← snew
Output: the final state s

I am having trouble understanding how this algorithm does not get stuck in a local optima as the temperature cools. If we jump around at the start while the temp is high, and eventually only take uphill moves as the temp cools, then isn't the solution found highly dependent on where we just so happened to end up in the search space as the temperature started to cool? We may have found a better solution early on, jumped off of it while the temp was high, and then be in a worse-off position as the temp cools and we transition to hill climbing.

An often listed modification to this approach is to keep track of the best solution found so far. I see how this change mitigates the risk of "throwing away" a better solution found in the exploratory stage when the temp is high, but I don't see how this is any better than simply performing repeated random hill-climbing to sample the space, without the temperature theatrics.

Another approach that comes to mind is to combine the ideas of keeping track of the "best so far" with repeated hill climbing and beam search. For each temperature, we could perform simulated annealing and track the best 'n' solutions. Then for the next temperature, start from each of those local peaks.

UPDATE: It was rightly pointed out I didn't really ask a specific question, I've updated in the comments but want to reflect it here:

I don't need a transcription of the pseudo-code into essay form, I understand how it works - how the temperature balances exploration vs exploitation, and how this (theoretically) helps avoid local optima.

My specific questions are:

Without modification to keep track of the global best solution, wouldn't this algorithm be extremely prone to local optima, despite the temperature component? Yes, I understand the probability of taking a worse move declines as the temperature cools (e.g. it transitions to pure hill-climbing), but it would be possible that you had found a better solution early on in the exploitation portion (temp hot), jumped off of it, and then as the temperature cools you have no way back because you're on a path that leads to a new local peak.
With the addition of tracking the global optima, I can absolutely see how this mitigates getting stuck at local peaks avoiding the problem described above. But how does this improve upon simple random search? One specific example being repeated random hill climbing. If you're tracking the global optima and you happen to hit it while in the high-temp portion, then that essentially is a random-search.
In what circumstances would this algorithm be preferable to something like repeated random hill-climbing and why? What properties does a problem have that makes it particularly suited to SA vs an alternative.

Upvotes: 3

Answers (3)

ldog

Reputation: 12151

The question as asked doesn't seem very coherent, it seems to ask about guarantees offered by the vanilla simulated annealing algorithm, then proceeds to propose multiple modifications with little to no justification.

In any case, I will try to be coherent in this answer.

The only known rigorous proof of simulated annealing that I know of is described as

The concept of using an adaptive step-size has been essential to the development of better annealing algorithms. The Classical Simulated Annealing (CSA) was the first annealing algorithm with a rigorous mathematical proof for its global convergence (Geman and Geman, 1984). It was proven to converge if a Gaussian distribution is used for GXY (TK), coupled with an annealing schedule S(Tk) that decreases no faster than T = To/(log k). The integer k is a counter for the external annealing loop, as will be detailed in the next section. However, a logarithmic decreasing schedule is considered to be too slow, and for many problems the number of iterations required is considered as “overkill” (Ingber, 1993).

Where the proof of this can be found in

Geman, Stuart, and Donald Geman. "Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images." IEEE Transactions on pattern analysis and machine intelligence 6 (1984): 721-741.

In terms of successful modifications of SA, if you peruse literature you will find many empirical claims of superiority, but few proofs. Having studied this problem in the past, I can say from the various modified versions of SA I have tried empirically for the problems I was interested in at the time, the only SA modification that consistently gave better results than classical simulated annealing is known as parallel tempering.

Parallel tempering is the modification of SA whereby instead of mixing one Markov-Chain (i.e. the annealing step) multiple chains are mixed concurrently at very specific temperatures, with exchanges between chains occurring on a dynamic or static schedule.

Original paper on parallel tempering:

Swendsen, Robert H., and Jian-Sheng Wang. "Replica Monte Carlo simulation of spin-glasses." Physical review letters 57.21 (1986): 2607.

There are various free and proprietary implementations of parallel tempering available due to its effectiveness; I would highly recommend starting with a a high quality implementation instead of rolling your own.

Note also, parallel tempering is extensively used throughout various physical sciences for modelling purposes due to its efficacy, but for some reason has a very small representation in other areas of literature (such as computer science,) which actually baffles me somewhat. Perhaps computer scientists general disinterest with SA modifications may stem from their belief that SA is unlikely to be improved upon.

Upvotes: 1

user1196549

Reputation:

The challenge in global function evaluation is to obtain good efficiency, i.e. find the optimal solution or one that is close to the optimum with as little function evaluations as possible. So many "we could perform..." reasonings are correct in terms of convergence to a good solution, but may fail to be efficient.

Now the main difference between pure climbing and annealing is that annealing temporarily admits a worsening of the objective to avoid... getting trapped in a local optimum. Hill climbing only searches the subspace where the function takes larger values than the best so far, which may fail to be connected to the global optimum. Annealing adds some "tunneling" effect.

Temperature decrease is not the key ingredient here and by the way, hill climbing methods such as the Simplex or Hooke-Jeeves patterns also work with diminishing perturbations. Temperature schemes are more related to multiscale decomposition of the target function.

Upvotes: 0

AirSquid

Reputation: 11903

By my understanding, simulated annealing is not guaranteed to not get stuck in a local maxima (for maximization problems), especially as it "cools" where later in the cycle as k -> kmax.

So as you say, we "jump all over the place" at the start, but we are still selective in whether or not we accept that jump, as the P() function that determines the probability of acceptance is a function of the goal, E(). Later in the same wikipedia article, they describe the P() function a bit, and suggest that if e_new > e, then perhaps P=1, always taking the move if it is an improvement, but perhaps not always if it isn't an improvement. Later in the cycle, we aren't as willing to take random leaps to lesser results, so the algorithm tends to settle into a max (or min), which may or may not be the global.

Upvotes: 5

Simulated Annealing - Intuition

Answers (3)

Related Questions