Kubernetes CPU limits and latency
We dig in to the effects of Kubernetes CPU limits on response latency.
CPU limits on Kubernetes, the topic that won't die. About a year and a half ago it was a topic here, I found the Google-able material to be sparse, and ended up doing my own research, then blogged about it. At that time, spurious alarms were a motivator. Time has passed, but the topic remains. Latency, in general the feeling of a service just being sluggish, comes up periodically. Can this be coming from CPU limits?
Others have written about this (for example) but I want more detail, to explore the edges, to test different cases.
We'll be using a workload that applies actual load on the CPUs, not just timers. Also we don't want to just fire up single-thread loads in our pods, because those are unrealistic. Or anyway, I think if you have a genuine single-thread-per-pod workload, you might want to think hard about changing it (graphs to support this follow). We'll be using a multi-threaded load. But not just spawning X load threads and counting sadness, we'll need to send requests into the server pods and let them deal with it naturally. Since all the threads/processes in a container are counted for CPU throttling, a CPU limit can be reached quickly where parallelization comes into play, especially when there is a bit of chaos going on (once again, graphs follow to support this claim).
So we need a good workload to test with. Its going to need to generate CPU load, and it doesn't really help us if it happens to do real-world tasks to make that CPU load. In fact, a realistic task might involve additional background components (databases, storage, other services) that can confuse our measurements by introducing unknowns (i.e. external variability). The main difference between a pure computation workload, and a workload that also consumes CPU while waiting for something external, is repeatability. Therefore I went for a pure computation workload, and decided to generate images of the Mandelbrot Set. Some arbitrary choices later, I settled on an image 34000 pixels by 30600 pixels, to be rendered in 26010 tiles of 200x200 pixels each. This works out to be a bit over one gigapixel, a symbolic achievement. Beyond geek points, this workload has a number of very good qualities:
- each tile requires specific calculations, the same every time, so one can expect the elapsed time for that tile to be the same
- the tiles needed to assemble the final image are always the same, meaning the distribution of expected latencies is always the same
- there is a wide range of difficulty for tile generation, from around 0.002 seconds to around 0.5 seconds on a robust 2024-era processor (the exact numbers of course are highly implementation specific, and I deliberately avoided various optimizations to achieve just this distribution of difficulties)
- there is a clear concentration of both difficult and easy tiles, making it easy to identify changes in latency for those cases (a large spike moves in the graphs)
- the actual work kernel requires no input data, should easily fit in the L2 cache on the CPU, and should not be computationally dense enough to cause a power or heat issue, providing some isolation from noisy neighbors
- the result image (assembled from 26010 tiles) always has the same checksum, so correctness can be verified
I implemented a webserver to do the work and a client to assemble the tiles, both in Python 3.12. The work (image generation) kernel inside the webserver is written in C, inlined via the cffi module. This was a low-effort way to access parallelism, and as we will see, it scaled well (though some evidence can be found that it holds things back for large numbers of easy tiles). One additional implementation detail is that I randomize the order of the tiles before requesting them, so that tiles of similar difficulty do not arrive in especially long sequences.
Groundwork, AKA "meet the real world"
So lets go over to the cloud and dig in. We need to avoid simple mistakes that might weaken our findings. First up, not all cloud instances are the same. I tested on numerous AWS c7i.4xlarge instances, and observed significantly differing performance. Some of this could perhaps be due to variations in hardware or location in the datacenter, but most of it, I presume, is due to what else is happening on the same underlying server. The first instance I used showed better latency and throughput for all tile sizes than the second instance. A third instance on which I ran a significant number of tests was somewhere between the first two. In a later testing phase, I was able to document the moment that a performance measurement of an instance changed from one stable level to another stable level, a lasting drop in performance of about 15%. To get usable results despite this variability, I took some precautions:
- did not blend results from different instances on the same graph
- ran every latency-distribution test configuration at least 5 times (5 full 26010-tile images)
- interleaved tests between different configurations so that any temporary performance variation would hopefully not effect only the results for one configuration (i.e. test configurations A, B & C in this order ABCABCABCABCABC)
Here we compare 10 runs with only 4-way request parallelism. The wide gray line is the average of the 20 runs above. We can see from this that a little lower latency can be found by only issuing 4 CPUs of work to the server (AWS c7i.4xlarge), even though it had 8 CPU cores supporting a total of 16 threads. So leaving empty cores results in a bit better latency.
I don't know about you, but I think the consistency of these 10 run results is impressive, just look at how they stick together through the middle. I am quite happy with how the Kubernetes & AWS software stack does its work here.
Another aspect investigated in this graph is the effect of running a single server pod vs running eight server pods. It appears there is no difference, at least with a 4-way request parallelism and no CPU limits. In the previous graph with 8-way request parallelism, there was no observable difference between one or two server pods. From this, we can probably conclude that there is no particular overhead associated with the number of pods.
After many test runs (detailed below), the last thing I did with the second test server was a final 20 standard runs with 8-way request parallelism to compare to the 20 from earlier, just to really check the consistency of that server over time. The latency might be hair lower, but its a small thing. This supports the trustworthiness of the main results.
Getting to the Point
There was even more "real world" going on, but that can wait. We have seen that EC2 instances can differ, but the results on the specific box I have been testing are turning out consistent.
We can now consider our workload. Lets start simple. We'll have the client issue requests with 8 threads, and since this is known to result in less than 8 CPU cores of work for the server pods, we'll deploy 8 server pods limited to 1 CPU core each. 8 x 1 = 8 right? How do you think that will go? Maybe it will be fine. It does, after all, have enough CPU allocated.
It went badly. The wide gray line is once again our 20 reference runs from above. The new lines are 5 runs with 8 client threads, 5 runs with 6 client threads, and 5 runs with 4 client threads. So, all the way down to half the task parallelism as server parallelism, and still this bad latency. The latency tail is long, sometimes 6x what a tile request ought to take, or around 3x for a 4-thread load. To emphasize that, thats half of the load that 8 pods at 1 CPU core each should be able to handle. I hadn't thought it would be quite so bad with only 4 threads, but even at that scale, the 0.5-second tiles were succeeding in generating log jams. Check out the CPU usage reported via Grafana/Prometheus (cycling 4t-6t-8t 5 times).
If you had a look at that alone, you might conclude that 1 CPU core per pod was a generous limit to set. It was so busy being screwed up, it couldn't even use the CPU, but the metrics won't tell you that (aside from the metric that exposes CPU throttling). And why is it so screwed up? Presumably the the difficult tiles, when they arrived, blocked the pod for long enough, often enough, for more difficult tiles to show up on the same pod. Even when there was twice as many server pods as client threads!
That was fun, but perhaps we can just set the CPU limits higher, I wonder how much we need to turn that knob.
Indeed with a limit of 4 CPU cores on each of the 8 pods (twice the CPU than the server actually has), you have almost eliminated the harm. The total time spent rendering tiles is still up by over 10% and there are some latency outliers, but perhaps nobody notices. The configuration with 2 CPU cores per pod (equal in sum to all the CPU on the server) remains kinda crappy, sometimes generating 3x the latency of running with no CPU limit. That wasn't good, but, perhaps this is no worse than a machine that is running out of CPU. Lets throw excessive requests at a set of pods without CPU limits.
16 threads adds some latency but nothing scary (the c7i.4xlarge EC2 instance supports 16 threads). Going up to 24 threads starts to stretch out the worst case, but yeah, it's 50% more parallel tasks than the EC2 instance has hardware to process at once. (Also, the client is running on the same server, using some CPU.) Some of those requests are going to wait to be scheduled, similar to a CPU-throttling situation. The worst-case latency is a bit under 3x the normal 8-thread case, and the shape of the graph looks a bit like the CPU-limits graph above.
How does this compare to similar abuse with CPU limits in place? If all the CPU available on the server is included in the various limits, maybe it will resemble using up all the CPU without limits.
Here we have 25 runs covering 5 configurations. Of these, 10 runs (2 configurations) use CPU limits that in total equal all the CPU on the server (16 "cores", which of course means "logical cores" i.e. threads in this case). The new thing here is to use 4 pods at 4 CPU cores limit, and 2 pods at 8 CPU cores limit, compared to 2, 4 and 8 pods without CPU limits. Short version: CPU limits are worse than just running the EC2 instance out of CPU. More detailed: the CPU limits are less bad when pooled into larger silos, i.e. 2 pods with 8 cores are better than 4 pods with 4 cores. (You already saw 8 pods with 2 cores get wrecked by a mere 8-way parallel workload.) This is because a larger pool of CPU cycles can be used where its needed.
So thats a lot of graphs. Where does this even leave us? Some thoughts:
- the assignment of new tasks to pods which are already busy is a major part of the problem documented above
- fewer & larger pools of CPU perform better than more & smaller pools
- the best and largest pool of CPU is accessed by setting no CPU limits
- CPU limits can result in CPU starvation at levels well under the actual CPU limit (when viewed at monitoring-relevant time scales)
Without the log jams (degenerating into revenge of the real world)
Why stop there? The log jams caused by large tiles being stacked onto already busy pods threaten to undermine the general-case usefulness of these results, as interesting as they may be. A person might wonder what CPU limits do to more uniform workloads.
To explore that, we still need to do CPU work to have something measurable, but we can be more selective of our tasks. I sat down and wrote a new client that would spam subsets of the possible tile requests at the server pods. First I identified some thousands of suitable tiles in the large test image, 18066 that could be expected to take under 1/20th of a second, and another 6507 (partly overlapping) that were not quite that fast, but still should be returned in under 1/10th of a second each if the server pod had time to work on them. I chose 1/10th of a second as the upper limit for the second group because that is the usual CPU-limit binning size. So theoretically even the worst tiles shouldn't quite fully utilize one core in one throttling period.
Because we want to explore all the things, I modified the server script to optionally run in single-threaded mode to contrast that with a 1-core limit (with threads), and with no CPU limit.
Some implementation details: because we are spamming subsets of the possible tiles, we are no longer generating the large result images. The client just randomly selects suitable tiles to request, and verifies that it gets valid image tiles back. Also, because we are not making an actual image, its feasible to ramp up the testing load slowly and to run it exactly as long we like. This shows different aspects of performance than our latency distributions shown above. Another detail: the client in this test is actually a series of subprocesses, up to 8 separate and threaded subprocesses spawned inside a single pod. That spam needs to be intense.
So, lets spam requests at our server pods.
The most obvious thing to take away from these graphs is that single-thread pods get trashed by 1-core-limit pods. In fact, the single-thread pods are not even close to using an entire CPU core regardless of how intense the spam is. Major win for threading.
But there is also something else, something more troubling. Why, I have to ask, does latency start ramping from the very beginning on the single-thread pods? This resembles the log-jam problem, but we just took away the largest requests. How can a group of 8 pods, being fed round-robin, without any large jobs, start to show more latency with only three workers? Being unable to handle more than one thing at once, single-thread pods would immediately show when they get two requests at once, which is indeed happening. This is happening despite the fact there are 8 pods, a mere 3-way request parallelism, and round robin distribution. Or is there?
Here we have a better graph, scaling from 1 to 24 client threads spamming only large tile requests, showing 99th percentile latency. Large tiles will easily consume a CPU core for nearly half a second. Three of these eight server configurations are made of pods which are (individually) unable to process two such requests at once, and those are also the three server configurations that show increased 99th percentile latency as soon as a second client thread joins the test. One of those server configurations that is impacted by the second client thread has 16 pods, the other two have 8. How does one round-robin themselves into a task collision with 16 pods, 2 tasks, and uniform task size? If task distribution was effective, its pretty impossible to see how such a result could be observed.
OK, no there is not round robin distribution. I wrote another little client and added more server features to document which server pods got which requests, and it turns out that using the Kubernetes service endpoints results in each request going to a randomly-selected server pod. So its totally random using the service endpoint.
Well, there is also ingress. You can establish an ingress in front of your server pods and connect to that instead. There are a lot of ways to set this up, we currently have an AWS load balancer in front of nginx pods. A quick read of the nginx ingress docs show round robin is one of the only two options, and the default, and what we use. I then tested this using the load balancer endpoint, and found its still pretty random, but less random, so what is happening? I drew a lovely picture illustrating a feasible explanation.
So first, each request goes to a random load balancer endpoint. (Each has a different IP, and at least in some cases a client will stick to one of them.) But thats not all: each load balancer endpoint sends the request onwards to a random nginx ingress pod. (Which is also configurable, but won't help, as I'll get to.) Then that nginx ingress pod dutifully tries to round-robin it based on its own internal state. The three ingress pods have different round-robin states, though. So if you use this ingress setup, you theoretically get (2/3) * (1/X) odds of hitting the same server pod with two consecutive requests (around 8% of requests with 8 server pods). In this example, there should be about a 0.35% chance of three consecutive requests from random clients going to the same server pod.
There are some settings for the nginx ingress and for the load balancers which can somewhat reduce the randomess (i.e. get closer to round-robin or some more intelligent distribution of requests) but the fundamental problem remains that the ingress pods are maintaining their own separate internal states. If different clients connect, they probably are going to hit different nginx ingress pods, therefore different round-robin states. If you want load-balancing redundant ingress pods, you are basically stuck with them not coordinating what they send to downstream server pods.
It seems this can be worked around with the right ingress infrastructure. An ideal distribution of requests could be expected to reduce the drawbacks of CPU limits. I wonder if that is even good though, you would have a known component that screws with performance in hard-to-quantify ways if not handled just right.
Be that as it may, we haven't quite got done examining CPU limits on the infrastructure that we have. So returning to that: anyone still holding out hope that CPU limits will stop harming performance should direct their attention back to the previous graph, to the 8 pods with 3-core limits, or 16 pods with 2-core limits, both of them having CPU limits in total well over the total CPUs on the server. They both performed somewhat worse than 8 pods without CPU limits. (The server, as before, had 8 cores supporting a total of 16 simultaneous threads.) Even very large limits continue to interfere.
Lets try one more thing just to try and explore all the angles we can. Lets just spam our smallest requests at three different server configurations.
Those graphs are a mess. The smallest tiles, being tiny and therefore chatty, are apparently more sensitive to AWS noise. All three of the server configurations sometimes performed worse than their baseline. To make it more messy, sometimes they show lower latency and lower throughput at the same time.
This type of test at least is very good at showing temporary deviations from baseline performance. The load ramps smoothly and we humans are good at seeing issues with lines. It is therefore possible to pick out the tests which performed consistently from start to finish.
Viewed from Grafana, these 12 test runs look as follows. You can see that none of them reach 1 core of load per pod (on average, in monitoring-relevant intervals), just getting a bit over 3/4 core per pod. Despite the impossibility keeping the CPU packed with work, despite the small work units, the flood of parallelism, the 1-core limited pods did slightly worse than the unlimited ones.
Are we there yet?
This is getting to be a pretty long writeup, but I can hear people saying "the CPU limit is there to protect other workloads". OK sure, we can test that too. Why not.
Lets start this party with the same kind of machine we've been using, c7i.4xlarge. Lets set up a test where there is a steady latency measurement going on while we also throw some noisy workloads at the server. There will be pauses in the noise, and the whole test will go on for 24 hours. We'll alternate between the noise coming from a set of pods without CPU limits, and a set of pods with CPU limits.
In order for all the pods to fit on a single server, we'll need to set CPU requests with care. It will also be important because we are relying on CPU requests to allocate some oxygen to the clients doing the testing. At the same time, we know that finely-divided CPU limit silos will harm throughput for our noise component, so we'll go with sets of 4 pods. Every pod in three separate deployments requests 1 core each, the client pod requests 2 cores, and 2 cores are left over for all the other random daemonsets and things that need to run. When CPU limits are in place, it will be 3 cores per pod, 12 cores total limit, on a machine with 8 real cores and 16 threads.
Here is how the pods all look deployed:
Then, I ran the first test. On the left (image below) we have the background latency, on the right, throughput. I have labeled the noise periods, which started with some 8-core loads, then did some 16-core loads, then some alternating 8- and 16-core loads. In all cases, it also alternated between no limits and 3-core limits. Note that at 8-core loads there is no particular difference, but both raised latency a bit. For 16-core loads, the no-limits pods made larger latency spikes than the CPU-limits pods.
Finally CPU limits had a positive effect! Did they, though? Its true there was less latency, but there was also less CPU cycles being consumed at all. You can see clearly that CPU limits did not prevent latency from creeping into other workloads, nor did absence of CPU limits cause anything too dramatic to happen. You can see that the 16-way CPU-limit tests somehow generated worse latency for the background test (compared to 8-way) while also not using much more CPU (indicating that it also didn't get much more work done).
TLDR: With the 16-way noise load aimed at 4 pods with 3C limits each, we still impacted latency and throughput for everyone while the server had lots of idle cycles. That is not winning.
We're really digging deep here though, so it has to be mentioned once more that c7i.4xlarge is using threads to reach 16 v-cores, which means the second half of the "cores" seen be Kubernetes are worse than the first half. Simply dipping into that 9th CPU "core" is going to yield worse performance on our "16 core" (i.e. 16 thread) box. This calls for another test, with a fresh instance type that gives you one real core for each core that Kubernetes sees. Meet the AMD-powered c7a.4xlarge: 16 cores, 16 threads.
This time, each noise test is the same. It alternates between 4 test configurations: 8-way unlimited, 8-way 3-core limited, 16-way unlimited, 16-way 3-core limited. Always 4 server pods, always 1 core CPU requests. Same as before.
So with the 16-core, 16-thread c7a.4xlarge, some things have changed. Overall its confusing, if you ask me. The 8-way noise load is as good as identical to no load at all, the base latency is lower, base throughput of the background test is quite a bit higher, but the latency tail is longer (i.e. the spikes are higher). I have no good idea why that would be (especially the latency spikes), I might have guessed some numbers would have gone the opposite way of what has been measured.
It can be difficult to pick out the 16-way 3-core limit results because they immediately follow the higher latency/throughput disturbance generated by the 16-way unlimited noise results. They are present, however, and once again the 3-core limits did not protect the background latency measurement from the noise, also when there was plenty of idle CPU around.
Of course, the CPU limits did reduce the interference, by forcing the server to be about 1/3 idle (and no doubt wrecking the latency distribution of the noise workload, scroll back several pages to see how that sort of thing looks). This idling of CPU and harming of "noise" workloads happens while still not stopping the interference with our test workload.
CPU limits are a tool that simply does not impress. Yes, we could run yet more tests, but I feel like we'd be trying pretty hard to find that case where CPU limits met their purpose. If they have a good use case, its a bit hard to find. I need to move on with real work.
The End.
So, there we have it. I have run hundreds of test runs with various angles on the question, invested over a week of work time, and have found arguably nowhere that CPU limits avoided a negative effect, with the exception of one case where CPU limits made no difference: a single pod with a CPU limit high enough to absorb all the work.
The problem (for most of these tests) is fundamentally that work does not arrive evenly, and siloing your resources into a lot of little CPU-limited pods is creating opportunities for them to be individually overwhelmed. There is always some CPU limit of course: all the CPU on the server. One option is to let that be your limit. Everything. In the case of insufficient CPU cycles, they are distributed according to the relative sizes of the CPU requests, which seems reasonable to me.
There is also the issue that not all performance can be nicely quantified and controlled by counting CPU cycles. There are caches, busses and clock boost budgets that escape attempts to put order in the world. Noisy neighbors, in your server or next to it, will be heard.
I can't see that CPU limits have much of a useful role to play.