Why you should keep using CPU limits on Kubernetes
Or why staying away from unused CPU may be good for your containers
I wrote this article as a counterpoint to “For the love of god, stop using CPU limits on Kubernetes.”
I like that article and consider it a great read. I agree with its recommendations about setting container memory requests and limits and the recommendation always to set CPU requests.
My disagreement, evident in the contrasting title, is with the extent of its final recommendation about not setting CPU limits.
As a lightning-quick recap* of Kubernetes resource management for containers: a CPU “request” is the minimum amount of CPU available to a container while it is running, and a CPU “limit” is the maximum amount of CPU the container can use.
The idea of banning CPU limits is rooted in the concept that setting a CPU limit has the sole negative effect of preventing the container from accessing the unused (and unreserved) CPU in the node, subjecting the container to unnecessary and potentially harmful CPU throttling.
However, after extensive research while writing “Infinite scaling with containers and Kubernetes,” I think there are good reasons to set CPU limits most of the time, a contrarian opinion explained in the following sections.
Update on 9/30: I retracted an entire section titled “Reason #6: Consider namespace resource quotas.” I had missed the requirement for all pods added to the namespace to have CPU limits in the first place, thus making the whole point moot for a debate on using versus not using CPU limits.
* For those who want to go deeper into how Kubernetes resources work, I recommend Layer-by-Layer Cgroup in Kubernetes from Stefanie Lai and this data-filled 3-part series by Shon Lev-Ran and Shir Monether.
Reason #1: Resilience under throttling
Updated on 9/30: The original title of this section was “Container probes are not compressible,” which obscured the main point of this section: developing containers with CPU limits in place makes them more resilient and better behaved under CPU throttling events. Creating liveness container probes that are more resilient to throttling events and readiness probes that are more accurate under throttling are only two of the many positive outcomes of developing containers with CPU limits.
When a container attempts to use more CPU than what is available — and ignoring preliminary scheduling-level mitigations like shuffling pods around nodes — the node “throttles” the CPU utilization for the container, effectively making it run slower. That is why Kubernetes documentation designates CPUs as “compressible” resources.
Therefore, there is a pervasive assumption that containers can safely undergo throttling events. However, that assumption stumbles upon an unintended consequence related to Kubernetes probes.
When the kubelet throttles a container, the throttling event potentially affects all threads in the container, including the ones processing liveness and readiness probes.
Even if the container has set a CPU request, the bursting workload inside the container may consume a disproportional share of that allotment, slowing down the probes’ responses.
Successive failures or timeouts of a readiness probe cause the kubelet to divert traffic from the container, while successive failures or timeouts of a liveness probe cause the kubelet to terminate the container.
I wrote extensively about good practices in designing container probes, which maximize the odds of probes responding to the kubelet while the container is under stress, but those odds decrease in direct relation to the container’s excess utilization above its CPU request.
[If throttling is bad for probes, setting CPU limits makes it even worse because it increases the number of throttling events.]
The idea is not that running containers with CPU limits in production makes them more resilient to throttling events. Of course, it doesn’t. The key aspect here is to develop containers with CPU limits from the onset, during the entire development phase.
The continued observation of undesirable behaviors under throttling — assuming a systematic approach to running the container under heavy load in resource-constrained nodes — informs important design and implementation decisions.
For instance, those observations can lead to improvements such as:
- Leveraging more of the container’s internal state in the container’s readiness probe, thus helping the kubelet to divert traffic to other replicas that are less overworked;
- Revisiting the design and tuning of internal thread pools;
- Surfacing additional tuning parameters to system administrators;
On the flip side, if you develop containers without CPU limits, their behavior is at the mercy of whatever spare CPU cycles are available on the node where it is running. A container that works consistently well in a development cluster with low utilization can behave very differently (and poorly) when it cannot access the same levels of unreserved CPU.
Net: Develop and benchmark containers with CPU limits turned on, striving to make the container work consistently under throttling events. When the kubelet throttles workloads, it has no concern for the consequences to the workload.
Reason #2: Losing the “Guaranteed“ status
Updated on 9/28: This section was titled “Higher penalties for higher bursts” and incorrectly stated that exceeding CPU requests was a factor in the node-pressure eviction algorithm. That is only true of memory and disk resources.
This one is a tangential reason but worth noting.
The settings of resource requests and limits for each container in the pod determine the pod’s Quality of Service:
- Not setting any requests or limits classifies the pod as “BestEffort,” a QoS ruled out in this story by our initial assumption that all containers have memory and CPU requests.
- Setting CPU request but not setting a CPU limit on a container classifies the pod as “Burstable.”
- The only way to achieve a “Guaranteed” QoS for a pod is to set identical request and limit values for CPU and memory for each container of the pod.
The most notable use-case for the QoS is “node-pressure eviction,” where the kubelet assesses which pods need to be moved out of the node to ease resource utilization.
Guaranteed pods are in the last group of pods to be considered for eviction — assuming equal pod priorities — whereas burstable pods tend to go first.
Note that the eviction assessment only considers memory, disk utilization, and filesystem inodes. In that sense, it is technically possible that a burstable pod may exceed its CPU request and still rank alongside guaranteed pods when eviction time comes.
A less common (or maybe popular) use case is the usage of the CPU management policies, where the benefits of a “Static policy” only apply to guaranteed pods (using integer CPU requests.) You can read more about the benefits of using the CPU Manager to understand how it benefits those specific workloads.
Net: Once again, strive to manage CPU usage within the container. Not setting CPU limits downgrades the containing pod to a “Burstable” QoS and makes it ineligible for favorable treatment in some scheduling and throttling scenarios.
Reason #3: CPU and memory usage are correlated
For better or worse, CPU and memory are inextricably correlated in computing. That is why no cloud provider offers VMs with a lopsided ratio of CPUs to memory, such as a 16x2 node (16 CPUs and 2 Gigabytes of RAM.) The most common ratios of CPU to memory in computing instances go in the opposite direction, with the number of CPUs typically a fourth or half of the memory (in gigabytes,) such as 8x16, 16x64, and 32x64.
There are many practical reasons for that kind of correlation, but in general:
- Reading and writing lots of memory takes lots of CPUs
- Using lots of CPUs means using lots of threads (with accompanying thread stacks in memory) and producing lots of data, which in turn need to go in memory (even if cached on their way to stable storage.)
One can theorize scenarios where many CPUs mainline their output into a disk (without caching the data in memory?). Still, if you have done your homework during development, it is hard to imagine how a container requesting 200ms of CPU and 256Mb of RAM can effectively benefit from using 2 CPUs (10x over the requested CPU amount) without needing to use more memory (even if not 10x more.)
Net: Leaving the door open for CPU utilization to run unchecked (up to the CPU limit of the entire node) without allowing corresponding increases in memory usage does not meet the usage pattern for most applications. From a system perspective, you are better off following the suggestion in the next section.
Reason #4: Pod autoscalers can do a better job
One of the arguments for letting containers freely exceed their CPU limits is that if some of the node CPU capacity is not in use, one might as well allow pods to use the extra capacity.
I agree with the principle, but as mentioned in the previous section, that extra CPU often requires more memory to support whatever extra work it is doing.
Pod autoscalers excel at the job of extending pod capacity in a predictable and balanced manner. I wrote about them in more detail in “Infinite scaling with containers and Kubernetes,” but here is a short recap:
- HPA and KPA can increase the number of pod replicas, thus adding more aggregate CPU AND memory capacity to the entire workload without the side-effect of a runaway thread squeezing container probes out of their operational range.
- VPA can increase the overall limits for CPU AND memory, ensuring the pod experiences a balanced adjustment of its resource requests and limits.
[ But pod autoscalers leverage resource usage versus requested values in their decisions, not limits. ]
That is correct. Also, besides the point.
The idea here is to discourage the practice of relying on bursting CPU utilization as a mechanism to get more work done in favor of components specifically designed for the purpose.
Net: Pod autoscalers are designed to maximize balanced resource utilization. While letting a container exceed its CPU reservation may occasionally help with a few scenarios, pod autoscaling does better across most scenarios in terms of increasing resource limits in a purposeful manner.
Reason #5: Hyper-scalers require container limits
Updated on 9/30 — The original title of this section was “Cluster autoscalers don’t like sudden movements” and (incorrectly) stated that CPU utilization (above container CPU requests) could trigger autoscaling events. Reader Shir Monether, in the comments section, correctly pointed out that this is not the case, so I rewrote the section to leave only the portion about managed container services.
This is a tangential reason, though I thought I would still list it.
If your workloads ever exceed the capacity of a single cluster and you decide to use a managed container service to run your containers, such as Code Engine or ECS, those services demand that you set CPU (and memory) limits.
That requirement helps the providers with resource allocation and works in your favor, too, since you pay by resource allocation, and limitless usage means limitless cost.
The advantages of accessing unused CPU capacity in a cluster node erode in direct proportion to how much you rely on that spare capacity.
Developing containers without CPU limits often leads to invisible dependencies on spare CPU cycles that may not be available across different environments, with consequences such as container probes getting squeezed out of vital CPU cycles, (slightly) increased odds of pod eviction in case of resource shortages, and unbalanced allocation of (unlimited) CPU versus (limited) memory.
If the goal of removing CPU limits is to maximize capacity allocation and maximum utilization of node and cluster resources, dedicated components designed with those goals in mind, such as pod and cluster autoscalers can often do a better job by also increasing memory allocation to match the CPU increases.
Designing containers for balanced resource utilization, with all container resource limits set closely (or identically) to resource requests throughout development, is a sound practice, especially when paired with effective probe design.
Well-designed container limits and probes let the kubelet know when to reroute traffic to other pod replicas across the cluster, while deferring to pod autoscalers to determine when and how to adjust the aggregate capacity of a pod workload.