Cloud application optimization is beyond human scale.
When we talk with our enterprise customers, we hear a common theme. Their number one challenge is how to optimize cloud-based application performance at the lowest possible cost. Regardless of whether they are new to public cloud or so-called "Cloud Veterans," almost every organization admits that managing their applications in the cloud is far more complex than anticipated. Performance is paramount, so the least risky thing you can do is over provision. But without an understanding of an application’s true resource needs, customers must use utilization as the proxy for demand.
This supply-only approach sees changes in resource utilization reactively but remains blind to changes in demand that cause them. To mitigate risk, staff guess at the levels of demand for resources they need to allocate – and they rarely guess right. This commonly used allocation-driven approach results in excessively over provisioned resources with no assurance of delivering a positive end-user experience.
The Turbonomic consumption-driven approach is unique, continuously self-actioning, and extensible, allowing customers to jettison their old, manual allocation model guesswork for simpler, dynamic and trustworthy resourcing across private, public (and eventually edge) clouds.
Common Cloud Application Initiatives
We see a number of approaches to managing application resources in the cloud, but here are the three most common.
- Manual Optimization Initiatives
This is the most common approach. These initiatives are done on a cadence that varies between companies. It typically involves a joint effort between IT Ops, cloud application owners, and finance— or FinOps in some mature customers. These initiatives require the most brilliant people in the organization to work together, reviewing an extraordinary amount of data to generate decisions on where and how they can reduce cost. The results mainly take the form of cleaning up forgotten resources (low-hanging fruit). However, these efforts are time-consuming, even for small environments, and it loses sight of the fact that application demand is dynamic and fluctuating… so application resources must be managed continuously if you’re really going for operational excellence.
- Cost Visibility and Reporting
This includes deep billing analysis, chargebacks, showbacks, and other forms of reporting and drill-downs. This approach is excellent at showing how big the problem is, and in some cases, helps with identifying billing anomalies. But similar to the cleanup-on-a-cadence approach, this initiative fails because there is no understanding of application performance and the exact resources required to assure their performance without the cost overruns. And it doesn't address the behaviors which led to the higher costs in the first place (more on that later).
- Cost Optimization Tools
This category includes two types of tools: native tools offered by the cloud vendor, such as Azure Advisor by Microsoft, and a plethora of tools from AWS, including AWS Compute Optimizer, AWS Trusted Advisor and others. The other type is third-party tools from ISVs that "specialize" in cloud cost reduction. Those who just started their cloud journey mostly use the native tools, while more mature customers have tried both. The feedback we hear is the same for both types: the optimization recommendations generated, especially the ones related to rightsizing workloads, could not be trusted.
Why Cost-First Optimization is Doomed to Fail
The sad reality is that we humans cannot optimize cloud resources at scale. No matter how many people you assign to this endeavor, the results will not match those of intelligent software. And despite the focus by both cloud providers and ISVs, the complexity (and resulting cloud cost overruns) is a growing problem for many organizations. Why? Because they still require people to make resourcing decisions. As mentioned above, their recommendations can’t be trusted to be automated. Why? Because their offerings do not take an application-first, consumption-based approach: they have no understanding of what resources an application truly needs to perform and therefore cannot determine exactly what is required--so people, your people, are relegated to analyzing vast amounts of data to make best-guess decisions...they rarely guess right.
You’re not alone. Our Turbonomic State of Multicloud Report has consistently found that the complexity of managing hybrid and multicloud environments is a top challenge to organization’s achieving their goals. It is a clear “winner” for leaders and those on par with the majority (i.e. those further along on their journey). Download the full 2021 report here.
Likewise, the top initiative for leaders was optimizing existing cloud resources for performance and cost, closely followed by advancing a multicloud strategy.
So, optimizing cloud investment is a top priority and top challenge. Why is it so hard? Let's start with the common tactics used as part of a manual approach to managing cloud application resources:
- Scaling or Rightsizing both compute and storage allocated to the applications
- Leveraging Reservations and Cloud Cost Models (i.e., Reserved Instances (RIs), AWS Savings Plans, etc.)
- Suspending application resources when not needed
- Deleting application resources when not needed
On the surface, each method is straightforward, but once you start diving deeper into each method and to understand what is needed to get the most out of it, plus the fact that everything must be done continuously, then perhaps you will understand why it goes beyond human scale.
1 – Effective Application Scaling Requires Complex Analysis
To truly optimize a cloud application, the most important thing is to ensure that the application gets exactly the resources it needs when it needs it. In other words, scaling a workload to the right instance type/VM size requires that all the workloads' compute (Mem, CPU), storage (IOPS, throughput), and network throughput metrics are accurately considered—including their peaks and averages.
Accounting for historical peaks and applying industry standards such as using different levels of peaks' percentile is an opportunity to drive greater elasticity where you can. For example, consider more aggressive sizing to 90th peaks percentile for non-prod/Dev workload while applying a more conservative 95th or absolute size to peaks (i.e., 100%) for production workloads. More on percentiles in this article.
You should also think about the Sampling Period to consider for scaling (i.e., Observation Period) when making the rightsizing decision. It all depends on the workload and its cyclical performance that you prefer to include based on your unique business context
On top of that, one must consider the resource limits that are enforced by the compute instance type, for example, on Azure, VM types include IOPS limit that will apply to all attached volumes. To learn more about this, please check out this article.
Bottom-line, the cloud makes it possible to unlock elasticity--ensuring applications get the resources they need when they need them--but achieving it is impossible with manual, allocation-based approaches.
2 – Scaling between Family Types is not easy, so most tools avoid it
In the cloud, you can choose which family (instance) your applications will run in. However, things get complex when trying to move applications between different families. For example, when scaling between families on AWS, there are multiple factors and limitations that must be taken into consideration. This includes ENA (Network) and NVMe drivers, restrictions imposed by the instance' AMI type (HVM or PV), making sure the compute supports the storage tier (EBS optimized, for example), as well as quota limits on the cloud accounts, etc.
It is also essential to understand the exact benefits each instance type offers (old generation vs. new generation) for both capacity and cost. For example, in the image below, you will notice m5.large offers more memory than m4.large, but it costs slightly less. This is very common with AWS, which eventually helps to move customers off older generation hardware over time. But it is not always the case.
Lastly, the most accurate and safest way to scale a workload, especially mission-critical, requires insight into the cloud application demand by collecting and analyzing essential application metrics such as heap, threads, and response times when determining what capacity to scale to. If you scale a Java-based workload based on vCPU or vMem metrics only, you will not only put application performance at risk. The application owners will refuse to take the suggested actions, never mind automating on a continuous basis of some form.
3 – You need a village to Manage Reserved Instances (RIs) at Scale
Reserved Instances (RIs) and AWS Savings Plans are one of the best ways to reduce public cloud costs. RIs are long term commitments of 1-year or 3-year terms for a certain amount of capacity. These pre-paid commitments allow users to enjoy significant discounts compared to On-Demand prices (up to 75% on AWS and 72% on Azure). But as we’ve discussed, a cost-first, allocation-based approach loses sight of the most fundamental requirement for cloud elasticity: applications must get the resources they need when they need it. It’s very, very difficult to accurately determine how much your cloud applications need today, let alone what they will need in the next 1 to 3 years. So, there’s more analysis and best-guessing because if you don’t scale your workloads to instance types that match your existing RIs, you’re paying twice: pre-paying for RIs and then still paying for the On-Demand offering.
RI management is even more complex when you have more than one cloud account… Did you know that you can share and use RIs between AWS accounts under the same billing family? Have you considered the complexity of doing so when you have more than 1 or 2 AWS accounts? Or multiple clouds?
Buying new RIs and managing them becomes unmanageable at scale. Considering the RI expiration, dynamic changes in the application workloads' demand, understanding the costs between different RI types and terms, and all while striving to hit your organization's RI coverage goals make these tasks impossible. According to many customers we talk to, some of them had teams of cloud and financial experts assigned to this full-time but still failed to hit their goals.
As illustrated above, managing RIs and rightsizing workloads are both complicated cloud application optimization methods. Suppose you combine the two, and you should. In that case, if you want to avoid buying RIs for oversized workloads, you will quickly realize that it is almost impossible to do both in parallel—not just for humans, but also for other cloud optimization tools.
4 – Don't Forget About Storage Optimization
Another application component that requires constant optimization is the storage the VMs use. Cloud providers offer multiple tiers of storage, each with unique capabilities. For example, on AWS, EBS volumes are provided in six tiers: io1, io2, gp2, gp3, st1, and sc1, each offers a different level of IOPS, throughput, sizes, burst model, and cost.
On AWS and Azure, depending on the volume type, applications can get more IOPS and/or Throughput capacity by simply increasing the size of the volume. Keep in mind that this is an irreversible action, which means you cannot go down in size. But it can be more cost effective than going to the next volume type.
Furthermore, with the latest volume offerings, such as Amazon EBS gp3 and Azure Ultra disks, users can modify assigned IOPS and Throughput independently from the size. There are many permutations, each with different application performance and cost impact, so which is the right one?
The exciting aspect of EBS tier modifications is that, in general, they can be done without downtime to the instance – but there are still multiple limitations that must be considered when switching between tiers.
5 – But wait… there’s PaaS
There has been a lot of focus among cloud users on Infrastructure-as-a-Service (IaaS) optimization because that’s what most people are using. Most organizations are “stuck” at optimizing IaaS after lifting and shifting applications to the cloud because it’s harder and takes longer than they anticipated. It’s become an obstacle to achieving true transformation via modern cloud native applications.
However, as organizations deploy new applications on the cloud, they will leverage Platform-as-a-Service (PaaS) such as Database services like Amazon RDS or Azure SQL or managed container services such as EKS (Amazon), AKS (Azure), or GKE (Google Cloud). Some of these services are more expensive since the provider handles the management of more elements of the application stack. To optimize PaaS, you need to collect particular metrics, for example, DTU for Azure SQL instances. Then you need to analyze the data and take actions. You need to understand the nuances of each PaaS service metrics and the cost structure in order to optimize it. Effective optimization sometimes means doing nothing (yes, this is not a typo), and sometimes it is a series of actions —for example, humans may see an application workload metric exceeds a threshold and will try to scale it up, but that's not always the right thing if the instance is using a burstable compute or storage. A series of actions would mean to size an instance with an RI to a cost-effective On-Demand instance type just to free the RI so it can be used for an instance that will benefit more from it.
If It’s Beyond Human Scale – How Can You Optimize Cloud Applications?
No human (or team of humans) can optimize their public cloud applications effectively at scale because it is too complex, time-consuming, and must be done continuously.
Now, imagine a platform that can do all of the above continuously – it does not sleep, take breaks and go on PTOs, or even worse, quit its job.
Imagine a platform that is a Subject Matter Expert in applications, compute, storage, containers, databases, finance/costs, and Reserved Instances – and can continuously manage the trade-offs between performance, policy compliance, and cost with fully automatable trustworthy actions.
You can stop imagining now - this platform is here!
Cloud Specialists are expensive and super busy. It will be more logical to assign your brightest cloud specialists to help with business innovation and focus on revenue-generating activities and let our enterprise-class proven platform, Turbonomic, do the optimization instead. Turbonomic will allow you to improve application performance, increase Cloud Ops staff productivity, and help drive up the VMs-to-Admin ratio.
If you want to learn more about Turbonomic’s capabilities, I have included a few useful links: