How long do Azure Functions need to cool down before they scale in?

Question

In our development environment, our functions app typically only requires a couple of servers.

Over the weekend we unexpectedly triggered a chain reaction which lead to gradual, sustained scaling out up to about 140 servers. Now that we've fixed the underlying issue and cleared the queues, the activity seems to have returned to normal. (Phew!)

The strange thing is that (30 mins later) we still have all those servers online. I would have expected them to start to get take offline quite quickly, but instead I'm seeing the number of "servers online" fluctuate (up as well as down) between about 110 and 140. The vast majority of these are sitting there with 0 requests/s, no CPU and no memory.

So, some questions:

Is this expected?
What are the heuristics that guide the number of servers that are brought / kept online?
When should I expect to see this number drop back down to the levels I'm used to?

kamil-mrzyglod · Accepted Answer

Just to add some to what @Mikhail wrote - how your function scale depends on the type of a trigger used. If you use a queue, the runtime will take a look at the queue length and scale up/down depending on the number of messages. With Event Hub this behaviour depends on the number of partition in a hub - the more you have, the more eager to scale your functions will be.

It is possible to take a look at the source code and try to understand at least part of the functionality. In fact, it is based on a timer and a concept of workers, which update their statuses and let the runtime decide, whether it is required to scale up or down.

The overall algorithm is described as follows:

protected virtual async Task MakeScaleDecision(string activityId, IWorkerInfo manager)
{
    if (DateTime.UtcNow < _scaleCheckUtc)
    {
        return;
    }

    try
    {
        var workers = await _table.ListNonStale();
        _tracer.TraceInformation(activityId, manager, workers.GetSummary("NonStale"));

        if (await TryRemoveIfMaxWorkers(activityId, workers, manager))
        {
            return;
        }

        if (await TryAddIfLoadFactorMaxWorker(activityId, workers, manager))
        {
            return;
        }

        if (await TrySwapIfLoadFactorMinWorker(activityId, workers, manager))
        {
            return;
        }

        if (await TryAddIfMaxBusyWorkerRatio(activityId, workers, manager))
        {
            return;
        }

        if (await TryRemoveIfMaxFreeWorkerRatio(activityId, workers, manager))
        {
            return;
        }

        if (await TryRemoveSlaveWorker(activityId, workers, manager))
        {
            return;
        }
    }
    catch (Exception ex)
    {
        _tracer.TraceError(activityId, manager, string.Format("MakeScaleDecision failed with {0}", ex));
    }
    finally
    {
        _scaleCheckUtc = DateTime.UtcNow.Add(_settings.ScaleCheckInterval);
    }
}

What is more, the answer why you see alive workers can be also found in the source code:

protected virtual async Task PingWorker(string activityId, IWorkerInfo worker)
{
    // if ping was unsuccessful, keep pinging.  this is to address
    // the issue where site continue to run on an unassigned worker.
    if (!_pingResult || _pingWorkerUtc < DateTime.UtcNow)
    {
        // if PingWorker throws, we will not update the worker status
        // this worker will be stale and eventually removed.
        _pingResult = await _eventHandler.PingWorker(activityId, worker);

        _pingWorkerUtc = DateTime.UtcNow.Add(_settings.WorkerPingInterval);
    }

    // check if worker is valid for the site
    if (_pingResult)
    {
        await _table.AddOrUpdate(worker);

        _tracer.TraceUpdateWorker(activityId, worker, string.Format("Worker loadfactor {0} updated", worker.LoadFactor));
    }
    else
    {
        _tracer.TraceWarning(activityId, worker, string.Format("Worker does not belong to the site."));

        await _table.Delete(worker);

        _tracer.TraceRemoveWorker(activityId, worker, "Worker removed");

        throw new InvalidOperationException("The worker does not belong to the site.");
    }
}

Unfortunately, some parts of the implementation are sealed(like IWorkerInfo), so you are unable to get the whole picture, we can only guess(or ask ;))

How long do Azure Functions need to cool down before they scale in?

Answers (2)

Related Questions