We have recently had some problems with CPU throttling alarms creating noise, and we had various conversations about why pods that are clearly not using anything approaching their CPU limit could continue to generate alarms. A lot is written about this online, but I needed something more substantial to create a consensus in our team. So, I made a little tool and tested things. Let share.
I put together a little script that uses timers to generate load for accurate fractions of a second. The load will always be either 100% of a thread, or it will be zero, when the script sleeps. The load is generated inside (i.e. is contained within) target time periods, each of which is 1/10th of a second. The load can be generated in one period, or two, all the way up to all ten periods. So there are two knobs to turn: what fraction of 1/10th of a second of load lasts, and how many times per second this load is issued.
Note that this load is unusual because it is applied in a time period, rather than however long it takes to finish a real task. Therefore if a pod experiences CPU throttling, it still ends the load on time, and (of critical importance) fewer CPU cycles will have been consumed. This can be observed.
The load is also unusual because it lacks variety, which as you can see later, is important for how the CPUThrottlingHigh alarms are triggered.
With this tool in hand, we can go see what triggers throttling and alarms.
So this test begins with one load per second, and extends the duration of that load in steps, up to 0.08 seconds. Each step is tested for 5 minutes, in this case. The metric you see simply fails to represent the CPU spikes, the spikes are too brief to see. (Its not a matter of being lucky when polling either, the spikes will simply never be recorded, because they are based on counters which are polled.) The metric correctly records the average, but fails to see the true nature of the load.
The pod has a CPU request and limit set at 250m (a.k.a. 25% of one thread, a.k.a. 0.25 seconds of 100% load per second). I have marked the test steps which show clear deviation between observed and expected CPU load. (To be clear, the expected values are not shown. Imagine the CPU load stepping up instead of maxing.) This difference in observation and expectation is due to throttling. You can see that a load composed of spikes of 0.06 seconds duration, once per second, is cut off at about the level that the similar load with 0.04-second spikes reached.
Clearly the average CPU usage here is way below 250m, so throttling acts on short time scales. It seems like the 0.04-second-spike load mostly escaped the throttling at one spike per second, but over to the right side, you can see that with 10 spikes per second of 0.04 seconds each, throttling was effective (only a little more CPU usage than the previous test step).
Lets look at a more detailed graph.
The Kubernetes platform of course has approximately infinite depth in which to delve. The reason any of this bubbled up to become a visible problem is that alarms were being triggered, creating noise and distraction. So, a word on how the alarms are configured is in order. We are using a PrometheusRule containing I believe essentially the defaults from a project Prometheus Monitoring Mixin for Kubernetes. Inside this PrometheusRule, there is a rule about CPUThrottlingHigh which compares two metrics "container_cpu_cfs_throttled_periods_total" and "container_cpu_cfs_periods_total", which reveal information about what the underlying system is up to. The alarm triggers if, in three subsequent 5-minute periods, there are more than 25% as many intervals (normally 0.1 seconds long) in which there was throttling as in which there was CPU load.
To cover another little bit of depth, that 1/10th of a second that keeps showing up is not something I invented. I'm no expert, but it seems that the underlying Linux kernel infrastructure for allocating CPU time, CFS Bandwidth Control, defaults to 100ms in a field "cpu.cfs_period_us" and the value surfaces in Kubernetes. This is naturally possible to change on your particular system, but that is beyond the scope of ... eh its beyond my scope.
So anyway, lets look a bit at one of those Prometheus(-compatible) metrics the alarm is based on.
This metric shows how many periods out of 10 (per second) had CPU cycles throttled. By the time the load was 0.06 seconds of the 0.1-second period, the spike was always getting throttled. Interestingly for 0.04-second spikes, these were only sometimes throttled until we reached 10 spikes per second. (This could be due to the spike semi-randomly being split between periods.)
The difficulty in measuring small violations of the CPU limit is real. I did extensive tests with spikes lasting 0.03 seconds (just slightly over a constant 250m) and these were almost never apparent in container_cpu_cfs_throttled_periods_total. Going to 0.04-second spikes improved detection a lot, but still only maybe 1/6 made it into the metric. I was performing these tests issuing one spike per second, with a variable number of following spikes each in their own seconds, and it didn't make any particular difference if there was one spike in 5 minutes, or 300 of them, the rate of detection remained constant ... and thus the alarm behavior did. Hmmm.
Continuing down this path, I found that a single spike of 0.05 seconds, issued once per five minutes, over a period of maybe 30 minutes, is sufficient to trigger a CPUThrottlingHigh alarm. This depends on there being little or no other CPU usage by the pod, so its a bit theoretical, but anyway I was able to trigger an alarm by exceeding the 250m CPU limit with a 5-minute average of 167 microcores of CPU load. Micro, not milli. Continuing to overdo the emphasis: that is 1500x less than 250m of constant CPU usage.
Anyway, I believe from all of this we can draw at least one conclusion: the CPU limit acts fast, probably faster than you actually wanted it to. It acts so fast that its actions can be overlooked or misunderstood.
A second conclusion: a CPU limit, especially a restrictive one, might jump in and silently interfere with tiny but peaky workloads, driving up the latency of those workloads in a way you probably did not intend (and might not notice).