How to Effectively Control Requests, the Silent Killer of Elasticity & Efficiency
Now that Kubernetes has graduated to enterprise-ready, organizations are looking to expand their platforms to support more applications and more lines of business. In order to do this at scale, they need to support these applications and lines of business in multitenant environments—even modern microservice applications have to share resources!
This is especially true for one of our customers in the healthcare industry. They took the plunge into Kubernetes (specifically Red Hat OpenShift) two years ago and are now “platform-first,” meaning that all new applications are architected for cloud native speed, agility, and elasticity; and existing applications are gradually being rearchitected for it. These are business-critical applications that, for example, leverage massive amounts of data and machine learning to help clinicians more accurately and effectively diagnose and treat their patients.
As a platform-first organization, this customer has to manage multiple applications on Kubernetes and is leveraging multitenancy. Kubernetes allows administrators to manage shared resources through Requests and Limits, allocating CPU and Memory. Limits can be overcommitted, but Requests are guaranteed resources. In other words, Kubernetes will only schedule a container pod when the sum of Requests of all containers in the pods can fit onto a node.
Enterprises that onboard applications to Kubernetes typically adopt certain best practices in terms of managing container resource Requests and Limits, see here. For most organizations, including our customers, LOBs onboard to the platform starting with an assigned namespace (or “project” in OpenShift) and applying a resource quota. This quota fences in the amount of resources allowed to pods that are deployed within this namespace, while requests guarantee that different LOBs will not overrun cluster level resources. When using quotas, a user who submits an application to Kubernetes is required to provide a resource request for the following reasons:
- To maintain a certain level of quality of service (QoS). This is especially critical for CPU-intensive applications where the requested CPU can be enforced by cgroup runtime when the node becomes congested.
- To guarantee a minimum amount of resources. Memory cannot be overcommitted or swapped in Kubernetes, so many Java services use requests to assure a minimum amount is available to the JVM to reduce risks of OOM.
- To make it easier to manage and plan resources in a multi-tenant environment. This is achieved together with resource quotas defined in namespaces, giving each app or LOB their own resources.
The Challenge: Today’s Multitenancy Best Practices are Still Complex to Manage and Limit Elasticity
The best practices around managing resource requests provide some manageability, but at scale our customers continued to face challenges. Because resource requests specified for a service are estimates, it is very easy for a cluster to get into the following situations:
- Overestimation of resources. It’s very common to see estimated resource usage being significantly higher than the actual usage. Understandably, Developers tend to be overly conservative about their resource usage needs. They often request resources with big buffers so that their service will not suffer.
- Underestimation of resources. This is a situation where the actual resource usage is higher than the requested usage, which can result in noisy neighbor problems, or even worse overcommitted nodes, where pods may be evicted due to resource pressure (for incompressible resources like memory). This resource contention results in poor application performance and poor experiences for customers.
The crux of the issue is that Developers only see and have responsibility for their service. A multitenant cluster will host many services, each with their own specifications that do not account for the compound effect of running together. Even if 100 services are all only 10% over-allocated, the sum demand on a cluster or project will appear "full" when it is not. The consequence is you cannot run more services, you cannot resize up limits on those that need it, and you end up wasting money running more nodes than needed.
It’s a very difficult challenge. Both DevOps and SRE teams worry about it. It’s hard to see the impact of overallocation and where to optimize when searching across hundreds, or even just 50 different services running in different projects. DevOps and SREs BOTH need to understand whether:
- Services are over allocated
- The project/namespace is under-allocated
- The cluster is undersized.
Requests are that silent "killer" to elasticity and efficiency.
How Turbonomic Helps You Effectively Manage Kubernetes Multitenancy
In Turbonomic 6.4, we introduced the discovery of resource requests in Kubernetes clusters. This allows customers to:
- Visualize the quantity of resource Requests at all levels of the stack from the pod to the node to the namespace and cluster.
- Visualize the historical trend of resource Requests at all levels.
In the UI screenshots below, Turbonomic shows Top Virtual Machines (nodes), in this case sorted by CPU requests (Figure 1). You can also view actual CPU usage compared to CPU requests over a period of time, in this case the last 2 hours (Figure 2).
Figure 1. Top Virtual Machines Sorted by CPU requests.
Figure 2. Actual CPU usage vs. CPU request on a node over a period of 2 hours.
But a cluster level view is not enough. Since resources are allocated by namespace, you also need a view into the guaranteed and actual utilization across each namespace. This allows better “self-service” for each LOB to see how effectively they are using what is reserved. For the Platform team, it allows them to spot which projects are running out of resources and which are not (See Figure 3 for a single namespace example).
For those projects that are filling up, you can assess if it is full based on actual usage, or more has been requested than necessary, which may be an opportunity to optimize the workload there. For the LOB or App team, they can go from looking at their namespace utilization and drill down to see the specific pods in their namespace.
Figure 3. Actual CPU/Mem usage vs. CPU/Mem request for a namespace, and the user can see resize actions for the pods deployed there.
Most importantly, Turbonomic identifies if there are opportunities to resize down requests and consolidate resources (“move” pods) on the nodes to accommodate more pods. Note the “Actions” buttons on the right in Figure 1. At a cluster level, Turbonomic will also determine how much capacity is available to onboard new services or projects. With the visualization (Figures 1 & 2) of Requests vs. actual usage customers can quickly see discrepancies between what is being allocated and what is being used. But when you’re operating at scale monitoring and dashboards are not enough.
Turbonomic Automatically Identifies Opportunities to Optimize Your Clusters for Performance and Efficiency—and You Can Automate It!
Automatically assuring continuous performance, while maximizing efficiency is what Turbonomic is about. This is what we’ve been doing since the virtualization days, when applications first started running in shared environments.
When it comes to Kubernetes, Turbonomic continuously analyzes the complete stack to ensure that applications’ real-time resource needs are exactly met by the underlying supply of resources. It generates specific actions that can and should be automated and/or integrated into your organization’s existing workflows, namely:
- Move actions to move pods from a node with full request utilization to a node with less utilized request.
Figure 4: Turbonomic determines two Container Pods currently experiencing congestion on the node, should be moved to a node that has the capacity for them. Note that Turbo identifies which pod to move where and when, before pods crash – a gap in the k8s orchestrator!
- Optimize continuous placement decisions so that the request utilization will not exceed allocate-able capacity (aka the eviction zone!).
- Generate resize down of resource request when request is overestimated compared to actual usage.
- Generate provision node action when all existing nodes are congested with resource request.
Figure 5: Turbonomic determines that the cluster requires another node to support the applications it is running.
With Turbonomic Multitenancy Does Not Mean You Have to Limit Elasticity
Limits and Requests are intended to make managing multitenancy easier. And they do that to a certain extent. But as I’ve discussed, there are challenges that arise because at the end of the day people are making allocation decisions. They have to estimate and add buffers because they can’t analyze dynamic demand 24/7. It puts application performance at risk, it can increase costs unnecessarily, and it limits elasticity—which is why you made that investment in Kubernetes in the first place. With Turbonomic software continuously and automatically making resource decisions for you, you can achieve the bottom-line benefits of multitenancy without limiting application elasticity.