Reputation: 3336
I'm trying to understand better how Linux's CFS (Completely Fair Scheduler) works behind the scenes to make some improvements on the Kubernetes side.
Well, let's imagine I have a Processor with 1 core only. It means I can execute 1 task at a time, no matter what, that's how a processor works. On Linux Kernel (>=2.6.23) nowadays we have a tool called CFS which will try to be fair giving all processes the same amount of CPU usage.
Then for the same 1-core processor, I have 2 Processes, in that case, CFS will try to set 50% of this core for each process (1/2=0.5), I know it's more complex than that, we have priorities and categories that will define the virtual runtime
so the CFS can pop up the correct one from the heap, in that case the least virtual runtime
.
Now I know how CFS chooses the correct process to run (based on virtual runtime
) and dispatches it to the processor core.
So, the next part I will explain isn't clear enough to me, so need your help folks to clarify my mind. Here is where things got confusing to me.
Let's say I have the same 2 processes (P1 and P2) and a 1-core processor. P1 needs 50ms to finish its job, and P2 needs 100ms. Ignoring CFS and just sending the P1 straight to the processor core will block P2 per 50ms, which means: P2 will take 150ms = 50ms (blocked by P1) + 100ms (CPU burst time). Like this diagram:
When CFS sets sched_latency_ns=10000000 (10ms)
, means each process can't take more than 10ms per execution. So, look at my diagram:
In that case, P1 will take a bit more than 100ms to finish, because we have some blocked time by P2, but on the other hand, P2 will be more efficient than waiting 100ms for p1 to release CPU, it's more fair for sure.
Now when Kubernetes comes into play, I can use a different unit, 100 millicores, and things get confused again because CPU is measured in time. Here is what I understood: 100milli = 100/1000 = 0.1, so if my CFS on Linux kernel is set to sched_latency_ns=10000000 (10ms)
, it means for 100milli we gonna have 1ms of CPU usage at a time (0.1*10ms=1ms). So, using cgroup
limited to 100milli means that my task will take only 1ms per time no matter if the sched_latency_ns
is greater than that.
Sorry for the long text, but it's not an easy thing to explain so tried to be very clear here. Thx anyway.
Upvotes: 1
Views: 1278
Reputation: 9022
Imagine you have 1 CPU core available and 1 pod that is assigned 1 CPU core via spec.resources.limits.cpu: 1
. This means that the pod is allowed to run for 1 second each 1 real time second. All processes running inside the pod share the cgroup so all the processes in total have 1 CPU-second at their disposal.
If you have one process inside the pod then this process will obviously run all the time.
If you have two processes running in the pod then each of them will run for half-time on average. This means the application will be throttled 50% of the time.
If you have 10 processes running then each of them will run for 100ms. The application will be throttled 90% of the time.
It's possible to monitor the amount of time each container spends being throttled using Cadvisor metrics container_cpu_cfs_throttled_seconds_total
and container_cpu_cfs_throttled_periods_total
.
The conclusion is that under high load you don't wanna enable CPU limits.
Upvotes: 1