In my role, I have the privilege of meeting multiple enterprise customers and prospects every day, discussing their challenges and pains related to the management and control (or lack of) of their hybrid environments.
The number one challenge I hear is around cloud cost optimization, regardless if they are new to Public Cloud or “Cloud Veterans”, almost every organization admits that their cloud bills are beyond what they projected and all their attempts to optimize it did not achieve the desired results.
However, sometimes, after we discuss our platform capabilities, some will claim that they “can do what Turbonomic does”, when I ask why they haven’t done so until now, the typical answers will be “We just need more time/people/spreadsheets/coffee”.
The sad reality is that we (humans) cannot optimize cloud costs at scale. No matter how many people you assign to this endeavour, the results will not match those of a powerful AI-based platform.
Yes, humans can do cost optimization, at small scale, for example when there is a handful or dozen workloads running on the cloud, but pretty quickly as the cloud footprint grows, they will lose control.
Don’t let anyone fool you, optimizing cloud costs is hard and complex.
In this blog post, I will share my top reasons on why Cloud Cost Optimization is beyond human scale.
Before I begin, let’s start with the common methods used to optimize cloud costs:
- Rightsizing (both compute and storage)
- Long-term commitment (i.e. Reserved Instances (RIs))
- Stopping/Deleting (Idle or unneeded resources)
On the surface, each method is simple, but once you start diving deeper into each method and understanding what is really needed in order to get the most out of it plus the fact that everything must be done continuously, then perhaps you will understand the reasoning behind my bold statement above.
1 – Effective Rightsizing Requires Complex Analysis
To truly optimize a cloud workload, the most important thing is to ensure you are Rightsizing the workload compute to the most cost-effective size while avoiding performance risks.
When scaling a workload to a new instance type/VM Size you need to ensure that all the workloads’ compute (Mem, CPU), storage (IOPS) and network throughput metrics are accurately considered -- including their peaks and averages.
When rightsizing a workload, you must consider the workloads’ historical peaks and apply industry-standards such as using different levels of peaks’ percentile where it makes sense. For example, consider more aggressive sizing to 90th peaks percentile for non-prod/Dev workload while applying a more conservative 95th or absolute size to peaks (e.g. 100%) for production workloads.
You also need to think about the period of time to consider for rightsizing (e.g. Observation Period/Sampling Period) when making the rightsizing decision, it all depends on the workload and its cyclical performance that you prefer to include, or not -- for example, you may want to consider the last 7 days for workloads that are more elastic while for workloads that are less elastic perhaps use a longer period such as the last 90 days.
The last two approaches will essentially allow you to unlock the elasticity the cloud offers – isn’t elasticity one of the reasons you moved to the Cloud? Surely you want to enjoy it as much as possible, otherwise you are not using the cloud correctly, and you will pay for it.
2 – Scaling between different Family Types is not easy; most tools avoid it
This gets even harder when trying to move between different Instance type families on AWS. When scaling between families there are multiple factors and limitations that must be taken into consideration such as ENA (Network) and NVMe drivers, limitations imposed by the instance’ AMI type (HVM or PV), making sure the compute supports the storage tier (EBS optimized for example) as well as quota limits on the cloud accounts, etc.
By the way, moving between family types is so complex that most cloud optimization tools don’t attempt to do it and prefer to scale only within the same family (Spoiler Alert: Turbonomic is not one of them). The downsize of scaling within the same family is that you are doubling the capacity and costs.
It is also important to understand the exact benefits each instance type offers (old generation vs new generation) for both capacity and cost. For example, in the image below, you will notice m5.large offers more memory than m4.large but it costs a bit less. This is very common with AWS, as this eventually helps to move customers off older generation hardware over time. But it is not always the case…
You may have noticed in the above image that both m4.large and m5.large provide the same exact vCPU amount (2) – but does that mean the vCPU speed will be the same? The answer is no. Even if you make scaling decision on a single metric like vCPU (you should not!) - how do you know what is the best instance Type to use to make sure the CPU is getting exactly what it needs?
In order to determine the CPU clock speed on public clouds, you must use Amazon’s EC2 Compute Unit (ECU) or Azure Compute Unit (ACU) - Have you ever tried to determine the actual vCPU speed of a cloud instance type using ACU or ECU? Hint: it is not very human-friendly.
Lastly, the most accurate way to rightsize a workload, especially mission-critical, requires insight into the application demand by collecting and analyzing important application metrics such as heap, threads, and response times when determining what capacity to scale to. If you Rightsize a Java-based workload based on vCPU or vMem metrics only might get into a world of pain.
3 – You need a village to Manage RI at Scale
RI is one of the best ways to reduce cost on public clouds, in high level, RIs allow users to enjoy significant discounts compared to On-Demand prices (up to 75% on AWS and 72% on Azure) by making a long-term commitment of 1-year or 3-year terms for a certain amount of capacity that is pre-paid.
To truly enjoy the savings RIs offer, you must ensure that you utilize your RI pool at the highest level possible by scaling the workloads to instances types that match your existing RIs, otherwise you are losing money twice: First is when you have unused RI, you already paid for, which means you are leaving money on the table; and second is when you have a workload that running using On-Demand/PAYG offering while there is an RI it could use.
The RI management is even more complex when you have more than one cloud account – Did you know that you can share and use RI between AWS accounts under the same billing family? Have you considered the complexity of doing so when you have more than 1 or 2 AWS accounts?
Buying new RIs and managing them will become unmanageable at scale; considering the RI expiration, dynamic changes in the demand of the workloads, understanding the costs between different RI types and terms and all while striving to hit your organization’s RI coverage goals make these tasks impossible. According to many customers we talk to, some of them had teams of cloud experts and financial experts assigned to this full-time but still failed to hit their goals.
As illustrated above, managing RI and rightsizing workloads are both complicated methods. If you combine the two, you will quickly realize that it is almost impossible to do both in parallel (not just for humans, even for other cloud optimization tools).
4 – Don’t Forget About Storage Optimization
Another aspect that requires constant optimization is the storage the VMs use. Cloud Vendors offer multiple tiers of storage, each with its own capabilities. For example, on AWS, EBS volumes are offered in four tiers io1, gp2, st1 and sc1, each offers a different level of IOPS, throughput, sizes, burst model and cost. To fully leverage some of the benefits, the instance must be using an instance type that is EBS optimized.
The interesting aspect of EBS tier modifications is that, in general, can be done without downtime to the instance – but there are still multiple limitations that must be considered when switching between tiers, and they differ if the volume is root or data volume. There is also a limit that requires to wait 6 hours between modifications.
There is also the aspect of being able to modify the IOPS capacity of an EBS tier (instead of modifying to a new tier) but it requires to size up the volume size.
5 – Managing Cloud Waste at Scale is Challenging
Public Cloud resources are “utilities”, you pay for what you use, just like electricity. For example, if you leave your house for a vacation you want to ensure you don’t leave the lights on or AC/Heating on if no one is home.
In the cloud, there are two areas you should focus on to reduce waste:
- Identifying and suspending idle workloads or suspending workloads after hours (usually non-prod) – for example, suspending a workload between 6PM to 6AM will yield 50% cost savings
- Identifying and deleting unattached storage – this can add tens of thousands of dollars a month to the cloud bill depending on the volumes’ tier, size and the total amount of unattached volumes
Many organizations will opt to use simple scripts to suspend and resume VMs based on schedule, but at some point, as the environment changes, the maintenance and update of these scripts will become a time-consuming effort.
The unattached storage problem is quite common, many users destroy instances and sometimes forget to delete associated volumes, and companies will be billed for these volumes regardless if they are used or not.
6 – Wait, there is more…
Optimize not just basic VMs (IaaS) – you most likely run more than just VMs on the cloud, if you are leveraging DB PaaS such as RDS or SQL on Azure or even leveraging managed container services such as EKS or AKS, you need to be able to collect metrics, analyze the data and take actions – you need to be able to understand the nuances of each PaaS service to truly optimize it.
Effective Optimization sometimes means doing nothing and sometimes it is a series of actions – for example, humans may see a workload metric exceeds a threshold and will try to scale it up, but that’s not always the right thing if the instance is using a burstable compute or storage; A series of action examples, would mean to size an instance with RI to cost-effective OD instance type just to free the RI so it can be used for an instance that will benefit more from it. I call it tactical forward-thinking optimization.
No human can optimize cloud workloads effectively at scale because it is too complex, time-consuming and must be done continuously if you really want to optimize your cloud costs going forward.
Cloud Specialists are expensive, it will be a waste of time and money to attempt to hire specialists to do optimization. It will be more logical to assign the cloud specialist to help with business innovation and focus on revenue-generating activities, and let an enterprise-class proven-platform, Turbonomic, to do the optimization instead. It is worth mentioning that Turbonomic will allow you to reduce the number of cloud specialists needed to operate your cloud, improve Cloud Ops staff productivity, and help drive up the VMs-to-Admin ratio.
Now, imagine a platform that can do all of the above continuously -- it does not sleep, take breaks and go on PTOs or even worst, quits its job;
Imagine a platform that is an SME in applications, compute, storage, containers, databases, finance/costs, RI, and can manage the trade-offs between performance, policy compliance and cost with trustworthy actions…You can stop imagining now - this platform is here ☺
Here are a few proof points from our customer base:
A fortune 500 Oil & Gas company used Turbonomic to optimize ~1500 instances on AWS and Azure – the result was a 18% reduction in their cloud bill, approx. $35k monthly reduction.
One of the world’s big 4 Professional organizations used Turbonomic to optimize approx. 7000 workloads on Azure – by executing our platform’s trustworthy actions they were able to save $394,000 / month from on their Azure bill (from compute and storage optimizations) and they are well on their way to save a total of $10,000,000 with Turbonomic.