Reputation: 11489
In our development environment, our functions app typically only requires a couple of servers.
Over the weekend we unexpectedly triggered a chain reaction which lead to gradual, sustained scaling out up to about 140 servers. Now that we've fixed the underlying issue and cleared the queues, the activity seems to have returned to normal. (Phew!)
The strange thing is that (30 mins later) we still have all those servers online. I would have expected them to start to get take offline quite quickly, but instead I'm seeing the number of "servers online" fluctuate (up as well as down) between about 110 and 140. The vast majority of these are sitting there with 0 requests/s, no CPU and no memory.
So, some questions:
Upvotes: 2
Views: 667
Reputation: 5008
Just to add some to what @Mikhail wrote - how your function scale depends on the type of a trigger used. If you use a queue, the runtime will take a look at the queue length and scale up/down depending on the number of messages. With Event Hub this behaviour depends on the number of partition in a hub - the more you have, the more eager to scale your functions will be.
It is possible to take a look at the source code and try to understand at least part of the functionality. In fact, it is based on a timer and a concept of workers, which update their statuses and let the runtime decide, whether it is required to scale up or down.
The overall algorithm is described as follows:
protected virtual async Task MakeScaleDecision(string activityId, IWorkerInfo manager)
{
if (DateTime.UtcNow < _scaleCheckUtc)
{
return;
}
try
{
var workers = await _table.ListNonStale();
_tracer.TraceInformation(activityId, manager, workers.GetSummary("NonStale"));
if (await TryRemoveIfMaxWorkers(activityId, workers, manager))
{
return;
}
if (await TryAddIfLoadFactorMaxWorker(activityId, workers, manager))
{
return;
}
if (await TrySwapIfLoadFactorMinWorker(activityId, workers, manager))
{
return;
}
if (await TryAddIfMaxBusyWorkerRatio(activityId, workers, manager))
{
return;
}
if (await TryRemoveIfMaxFreeWorkerRatio(activityId, workers, manager))
{
return;
}
if (await TryRemoveSlaveWorker(activityId, workers, manager))
{
return;
}
}
catch (Exception ex)
{
_tracer.TraceError(activityId, manager, string.Format("MakeScaleDecision failed with {0}", ex));
}
finally
{
_scaleCheckUtc = DateTime.UtcNow.Add(_settings.ScaleCheckInterval);
}
}
What is more, the answer why you see alive workers can be also found in the source code:
protected virtual async Task PingWorker(string activityId, IWorkerInfo worker)
{
// if ping was unsuccessful, keep pinging. this is to address
// the issue where site continue to run on an unassigned worker.
if (!_pingResult || _pingWorkerUtc < DateTime.UtcNow)
{
// if PingWorker throws, we will not update the worker status
// this worker will be stale and eventually removed.
_pingResult = await _eventHandler.PingWorker(activityId, worker);
_pingWorkerUtc = DateTime.UtcNow.Add(_settings.WorkerPingInterval);
}
// check if worker is valid for the site
if (_pingResult)
{
await _table.AddOrUpdate(worker);
_tracer.TraceUpdateWorker(activityId, worker, string.Format("Worker loadfactor {0} updated", worker.LoadFactor));
}
else
{
_tracer.TraceWarning(activityId, worker, string.Format("Worker does not belong to the site."));
await _table.Delete(worker);
_tracer.TraceRemoveWorker(activityId, worker, "Worker removed");
throw new InvalidOperationException("The worker does not belong to the site.");
}
}
Unfortunately, some parts of the implementation are sealed(like IWorkerInfo
), so you are unable to get the whole picture, we can only guess(or ask ;))
Upvotes: 2
Reputation: 35154
Scale Controller is a closed source proprietary piece of technology with undocumented behavior. There is no official answer to your question. The only way to answer it is to do what you did: run an experiment and measure. Such behavior may change over time though.
Functions do have some heuristics on when to scale out and when to scale in based on amount of evens and resource utilization. In the past, they were quite frugal but that did not provide good enough scale out pace. Now, they tend to err on the side of provisioning too many rather than too few.
The good news is that you shouldn't really care about scaling in, except for out of curiosity. You don't pay for idle servers.
Upvotes: 2