Most customers I come across who already run containerized applications are dealing with the complexities of managing multiple resources needed for their environments. I find that there are 2 types of challenges:
- It is difficult to manage the configuration options within K8s at scale (Utilization, Quotas, Requests, Limits)
- It is humanly impossible to understand the impact of changes in K8s on the underlying infrastructure. (i.e. if I size down my request, what is the impact on the physical host serving the pod?)
In this post, I will explore each one of these challenges.
Managing K8s Tradeoffs at Scale
The most common 4 limited capacity factors in managing the environment are:
Container pods consume memory and CPU from their underlying nodes. High utilization is one of the most common reasons for performance issues, yet often overlooked when managing capacity, since one hopes that best practices around limits, quotas and requests will drive the right utilization.
The goal is reaching continuous effective CPU / Memory utilization on your nodes. This takes advantage of the underlying resources yet maintains headroom for burst in demand. A target I often hear is around 70% utilization. Eventually, this comes down to a level of risk you are willing to take. It might also differ between different applications (How hot do you want to drive your nodes?).
Turbonomic best practices for safely maximizing node resource utilization:
Continuous placements of Pods on nodes enables best usage of the underlying infrastructure. Since not all containers need all resources at the exact same time, by managing the resources dynamically you can assure once a pod needs resources, those are available.
Requests are guaranteed capacity for a container. Requests have nothing to do with the actual memory utilized. Meaning, once a new pod is scheduled K8s checks the available request capacity and will place the pod accordingly, regardless to the actual utilization.
In other words, if a Pod is scheduled and no node has enough request capacity – it will not be placed.
Turbonomic best practices for managing requests:
Right sizing requests for all individual containers will allow more pods to be placed into the environment safely. Continuous pods placement can help prevent a single node running out of request capacity.
Limits determine the maximum amount of CPU & memory that can be used by a container at runtime. Nodes can be overcommitted on limits (as opposed to requests). i.e. the sum of the limits can be higher than the nodes resources.
Typically, you are interested in overcommitting these resources. Otherwise, you’ll introduce significant waste to the environment. Nevertheless, overcommitment of resources by nature always introduces a risk of running out of resources, which should be managed carefully.
Turbonomic best practices for managing limits:
Continuous placement of pods in nodes enable minimizing risk of OOM errors due to the actual memory exceeding the allocatable memory on the node.
Right sizing limits to actual demand allows a tighter control band, allowing predictable performance and safer usage of the burstable space in the node.
Quotas are a mechanism for the K8s admin to limit the use of an individual team, so it won’t use more than its fair share of resources (either requests or limits). Once a project reaches its quota, no additional pods will be accepted for that project. This is completely separately to the actual utilization on the cluster, or the available capacity for the cluster as a whole.
Turbonomic best practices for managing quotas:
Continuous Right sizing of requests and limits will avoid project running out of quota. No need to limit your project for resources not used anyway.
Understanding the Complete Supply Chain
Obviously, the utilization of the underlying infrastructure is a major function in capacity planning. Especially if you are running on premises. Even more so, if there is a virtualization layer in the stack (hypervisor, such as VMWare). At the end of the day, if your underlying infrastructure is highly utilized, how can you increase your density?
In the words of Scotty - You can’t change the laws of physics. High CPU Ready Queue creates latency that can cause failures on a container 4 layers above it.
To explain the complexity, think about this scenario:
CPU wait time (Ready Queue) is building up on a physical ESX Host ->
Different VMs on the host start suffering from latency ->
One of the VMs, that gets heavily impacted, is a K8s node ->
Containers in that node start crashing and can’t be rescheduled in time.
Turbonomic best practices for managing the complete Supply chain:
Capacity & performance must be managed holistically up and down the entire physical / virtual stack.
But there's one more thing…
As is, continuous sizing and placement of pods on nodes is an NP-complete problem. Furthermore, there are multiple constraints these placements HAVE to follow. Such as separating master nodes to different ESX hosts on premises or geolocation policies at scale on the cloud.
Microservices applications introduce significant complexities to performance and capacity management. In order to operate your system at scale a system should be able to continuously scale, place and configure your estate, within context of multiple resources up and down the supply chain.