Cloud application optimization is beyond human scale. Allow me to explain.
In my role, I have the privilege of meeting and talking to multiple enterprise customers and prospects every day, discussing their challenges and pains related to the management and control (or lack of) of their cloud applications’ performance and cost.
The number one challenge I hear is around cloud cost optimization. Regardless of whether they are new to public cloud or so-called "Cloud Veterans," almost every organization admits that their cloud bills are beyond what they projected and that their attempts to optimize their cloud based applications did not achieve the desired results.
In some cases, after we discussed our platform capabilities, some claimed that they "can do what Turbonomic does." When asked why they haven't done so until now, the typical answers will be, "We just need more time/people/spreadsheets/coffee." Over and over again, I have seen that this is not the case.
Common Cloud Application Initiatives
Based on my discussions with cloud users, the below approaches are the top three cloud application initiatives used across the board:
Manual Optimization InitiativesThis is the most common approach. These initiatives are done on a cadence that varies between companies, usually when a CxO had enough with the out-of-control cloud bills. It typically involves a joint effort between IT Ops, cloud application owners, and finance— or FinOps in some mature customers. These initiatives require the most brilliant people in the organization to work together, reviewing an extraordinary amount of data to generate decisions on what and how to reduce the cost. These efforts sometimes yield results, mainly in the form of cleaning up forgotten resources (the low-hanging fruits of cost optimization). However, these efforts are time-consuming, even for small environments. Unless done continuously, their impact is “gone like the wind.”
Cost Visibility and ReportingThis includes deep billing analysis, chargebacks, showbacks, and other forms of reporting and drill-downs. This approach is excellent at showing how big the problem is, and in some cases, helps with identifying billing anomalies. Where this initiative fails is that it rarely helps with reducing cloud application costs or addressing the behaviors which led to the higher costs in the first place (more on that later).
Cost Optimization ToolsThis category includes two types of tools: native tools offered by the cloud vendor, such as Azure Advisor by Microsoft, and a plethora of tools from AWS, including AWS Compute Optimizer, AWS Trusted Advisor and others. The other type is third-party tools from ISVs that "specialize" in cloud cost reduction. Those who just started their cloud journey use mostly the native tools, while more mature customers have tried both. The feedback we hear is the same for both types: the optimization recommendations generated, especially the one related to rightsizing workloads, could not be trusted.
Why These Cloud Application Optimization Approaches are Doomed to Fail
The sad reality is that we humans cannot optimize cloud-based applications, or their associated costs at scale. No matter how many people you assign to this endeavor, the results will not match those of an intelligent software platform. As mentioned above, many software platforms fail to generate actual results.
Yes, humans can optimize apps and perform cost optimization on a small scale. For example, when there is a handful or maybe dozen applications running on the cloud. But pretty quickly, as the cloud footprint grows, they will lose control.
Don't let anyone fool you: optimizing applications in the cloud is challenging and complex. It is a growing problem for many organizations, and despite the focus by both cloud providers and ISVs, the challenge is growing every year, as the Flexera State of the Cloud reports between 2016 and 2020 illustrated.
Now, we’ll get into why cloud app cost optimization is beyond human scale.
Before I begin, let's start with the common tactics used as part of a manual approach to optimize costs:
- Scaling or Rightsizing both compute and storage allocated to the applications
- Leveraging Reservations and Cloud Cost Models (i.e., Reserved Instances (RIs), AWS Savings Plans, etc.)
- Suspending application resources when not needed
- Deleting application resources when not needed
On the surface, each method is straightforward, but once you start diving deeper into each method and to understand what is needed to get the most out of it, plus the fact that everything must be done continuously, then perhaps you will understand why it goes beyond human scale.
1 – Effective Application Scaling Requires Complex Analysis
To truly optimize a cloud application, the most important thing is to ensure you are scaling its workloads’ compute to the most cost-effective instance type while avoiding performance risks.
When scaling a workload to a new instance type/VM size, you need to ensure that all the workloads' compute (Mem, CPU), storage (IOPS, throughput), and network throughput metrics are accurately considered—including their peaks and averages.
When rightsizing a workload, you must consider the workload’s historical peaks and apply industry standards such as using different levels of peaks' percentile where it makes sense. For example, consider more aggressive sizing to 90th peaks percentile for non-prod/Dev workload while applying a more conservative 95th or absolute size to peaks (i.e., 100%) for production workloads. More on percentiles in this article.
You also need to think about the Sampling Period to consider for scaling (i.e., Observation Period) when making the rightsizing decision. It all depends on the workload and its cyclical performance that you prefer to include.
The last two approaches will essentially allow you to unlock the elasticity the cloud offers. And isn't elasticity one of the reasons you moved to the cloud? Surely you will want to enjoy it as much as possible, otherwise, you are not using the cloud correctly, and you will pay for it.
On top of that, one must consider the resource limits that are enforced by the compute instance type, for example, on Azure, VM types include IOPS limit that will apply to all attached volumes. To learn more about this, please check out this article.
2 – Scaling between different Family Types is not easy; and so most tools avoid it
In the cloud, you can choose which (instance) family your applications will run in. However, things get complex when trying to move applications between different families. For example, when scaling between families on AWS, there are multiple factors and limitations that must be taken into consideration. This includes ENA (Network) and NVMe drivers, restrictions imposed by the instance' AMI type (HVM or PV), making sure the compute supports the storage tier (EBS optimized, for example), as well as quota limits on the cloud accounts, etc.
It is also essential to understand the exact benefits each instance type offers (old generation vs. new generation) for both capacity and cost. For example, in the image below, you will notice m5.large offers more memory than m4.large, but it costs slightly less. This is very common with AWS, which eventually helps to move customers off older generation hardware over time. But it is not always the case.
Lastly, the most accurate and safest way to scale a workload, especially mission-critical, requires insight into the cloud application demand by collecting and analyzing essential application metrics such as heap, threads, and response times when determining what capacity to scale to. If you scale a Java-based workload based on vCPU or vMem metrics only, you will not only risk the application performance. The application owners will refuse to take the suggested actions due to lack of trust.
3 – You need a village to Manage Reserved Instances (RIs) at Scale
Reserved Instances (RIs) and AWS Savings Plans are one of the best ways to reduce application costs on public clouds. RIs allow users to enjoy significant discounts compared to On-Demand prices (up to 75% on AWS and 72% on Azure) by making a long-term commitment of 1-year or 3-year terms for a certain amount of capacity that is pre-paid based on the resources you believe your cloud-based applications will consume during that period.
To truly enjoy the savings RIs offer, you must ensure that you utilize your RI pool at the highest level possible by scaling the workloads to instance types that match your existing RIs, otherwise, you are losing money twice. The first is when you have unused RIs you already paid for, which means you are leaving money on the table. Second, when a workload runs using an On-Demand offering while there is an available RI it could use instead.
RI management is even more complex when you have more than one cloud account… Did you know that you can share and use RI between AWS accounts under the same billing family? Have you considered the complexity of doing so when you have more than 1 or 2 AWS accounts? Or multiple clouds?
Buying new RIs and managing them becomes unmanageable at scale. Considering the RI expiration, dynamic changes in the application workloads' demand, understanding the costs between different RI types and terms, and all while striving to hit your organization's RI coverage goals make these tasks impossible. According to many customers we talk to, some of them had teams of cloud and financial experts assigned to this full-time but still failed to hit their goals.
As illustrated above, managing RIs and rightsizing workloads are both complicated cloud application optimization methods. Suppose you combine the two, and you should. In that case, if you want to avoid buying RIs for oversized workloads, you will quickly realize that it is almost impossible to do both in parallel—not just for humans, but also for other cloud optimization tools.
4 – Don't Forget About Storage Optimization
Another application component that requires constant optimization is the storage the VMs use. Cloud providers offer multiple tiers of storage, each with unique capabilities. For example, on AWS, EBS volumes are provided in six tiers: io1, io2, gp2, gp3, st1, and sc1, each offers a different level of IOPS, throughput, sizes, burst model, and cost.
On AWS and Azure, depending on the volume type, applications can get more IOPS and/or Throughput capacity by simply increasing the size of the volume. Keep in mind that this is an irreversible action, which means you cannot go down in size. It can be more cost effective vs. going to the next volume type.
Furthermore, with the latest volume offerings, such as Amazon EBS gp3 and Azure Ultra disks, users can modify assigned IOPS and Throughput independently from the size. There are many permutations, each with different application performance and cost impact, so which is the right one?
The exciting aspect of EBS tier modifications is that, in general, they can be done without downtime to the instance – but there are still multiple limitations that must be considered when switching between tiers.
5 – Managing Cloud Application Waste at Scale is Challenging
Public cloud resources are essentially "utilities": you pay for what you use, just like electricity, natural gas, or water. Therefore, cloud should be treated like a utility. For example, if you leave your house for a vacation, you want to ensure you don't leave the lights, AC/Heating, or faucet on when no one is home. The same thing should be applied to the applications running in your cloud estate.
In the cloud, there are two areas you should focus on to reduce waste:
- Identifying and suspending idle application workloads or suspending workloads after hours (usually non-prod). For example, suspending a workload between 6 PM to 6 AM will yield 50% cost savings. Add weekends, and you are well over 70% savings.
- Identifying and deleting unattached storage. This can add tens of thousands of dollars a month to the cloud bill depending on the volumes' tier, size, and the total amount of unattached volumes.
Many organizations will opt to use scripts to suspend and resume VMs based on a schedule. Still, at some point, as the environment changes, the maintenance and update of these scripts will become a time-consuming effort. They don’t allow for one-time overrides of the schedule if an application is needed outside of normal hours. Furthermore, many organizations delegate the suspension responsibilities to the cloud application owners, so easy to use self-service experience is a must.
The unattached storage problem is quite common, many users destroy instances and sometimes forget to delete associated volumes, and companies will be billed for these volumes regardless of if they are used or not.
6 – Wait… Don't forget about PaaS
There has been a lot of focus among cloud users on Infrastructure-as-a-Service (IaaS) optimization, mainly because it makes up a big chunk of the cloud bill. Another reason is that IaaS is where many organizations are "stuck at" after lifting and shifting applications to the cloud, before they transform into "Cloud-Native Modern Applications," which always takes more time than initially planned.
However, as organizations deploy new applications on the cloud, they will leverage Platform-as-a-Service (PaaS) such as Database services like Amazon RDS or Azure SQL or managed container services such as EKS (Amazon), AKS (Azure), or GKE (Google Cloud). These services are often more expensive since the provider handles the management more elements of the application stack. To optimize PaaS, you need to collect particular metrics, for example, DTU for Azure SQL instances. Then you need to analyze the data and take actions. You need to understand the nuances of each PaaS service metrics and the cost structure in order to optimize it. This includes the ability to not only scale PaaS Services but also suspend non-production PaaS services.
Effective optimization sometimes means doing nothing (yes, this is not a typo), and sometimes it is a series of actions —for example, humans may see an application workload metric exceeds a threshold and will try to scale it up, but that's not always the right thing if the instance is using a burstable compute or storage. A series of actions would mean to size an instance with RI to cost-effective OD instance type just to free the RI so it can be used for an instance that will benefit more from it. I call it forward-thinking tactical optimization.
If It’s Beyond Human Scale – How Can You Optimize Cloud Apps?
No human can optimize their public cloud applications effectively at scale because it is too complex, time-consuming, and must be done continuously.
Now, imagine a platform that can do all of the above continuously – it does not sleep, take breaks and go on PTOs, or even worse, quit its job.
Imagine a platform that is a Subject Matter Expert in applications, compute, storage, containers, databases, finance/costs, and Reserved Instances – and can continuously manage the trade-offs between performance, policy compliance, and cost with fully automatable trustworthy actions.
You can stop imagining now - this platform is here!
Cloud Specialists are expensive and super busy. It will be more logical to assign your brightest cloud specialists to help with business innovation and focus on revenue-generating activities and let our enterprise-class proven platform, Turbonomic, do the optimization instead. Turbonomic will allow you to improve application performance, increase Cloud Ops staff productivity, and help drive up the VMs-to-Admin ratio.
If you want to learn more about Turbonomic’s capabilities, I have included a few useful links:
- Scaling Compute Resources using Percentiles
- Scaling Compute with full IOPS awareness
- Scaling Block-Storage Volumes
- Scaling Azure SQL Databases (PaaS)
- Scaling and Optimizing Kubernetes Clusters, Pods and Containers (PaaS)
- Leveraging Reservations
- Suspending Workloads to increase elasticity