Welcome to the second article of Turbonomic’s new blog series, ‘Mastering Cloud Cost Optimization.’
In this article, we will share the frameworks employed by organizations that were successful with their digital transformation and cloud optimization journey.
The Cloud Spend Optimization Initiative (and Challenge)
In 2017, RightScale’s (now Flexera) State of the Cloud report listed optimizing existing cloud spend as the top initiative for cloud users (53%) for the first time, replacing the 2016 top initiative ‘Moving more workloads to cloud’:
The cost optimization initiative has remained number one in every report since, and in 2020, 73%(!) listed it as their top initiative:
So why do organizations still find cloud cost savings as their top cloud initiative and top challenge? As you probably already know from your own experience -- it isn’t as easy as it sounds.
In my previous article, ‘Mastering Cloud Cost Optimization: The Principals’ I covered the main cloud cost optimization challenges and the core principals needed to accomplish a well-architected and continuously optimized cloud environment. Before proceeding, make sure to read it for more details and context.
It is time for action
For years, organizations attempted to achieve cost optimization by focusing on reporting. This included chargeback reports, long excel spreadsheets, and dashboards with pretty charts and graphs.
Sadly, this approach rarely works. Staring at data will not reduce a cloud bill, nor will sending reports back and forth.
To be clear, this is not to minimize the importance of cost visibility and reporting – cost visibility is a critical foundation of any organization’s cloud cost optimization strategy and required to establish the needed accountability, but it is not enough. To optimize a cloud environment, you must act, and you must execute actions -- but again, this is easier said than done.
To execute actions, one needs to understand what actions to take, when to take them, and understand what the implications of that action will be beyond just cost savings.
I have seen and talked to organizations that created an internal process to identify, analyze, and execute cloud optimization actions. Many admitted the process is time-consuming, cumbersome, manual, and not scalable, especially in large cloud environments where their efforts had limited impact.
The solution? Automation – the ability to execute optimization actions without any human intervention. From a technical perspective, automation is not hard, especially in public clouds, which offer well documented and robust APIs. The main two challenges with achieving automated cloud optimization are complexity and trust.
The complexity refers to the process of generating an accurate and actionable cloud optimization action, such as rightsizing a VM, a PaaS service, or even a container.
For example, to properly resize a single workload, you first need to observe the utilization of its resources across multiple metrics (i.e., CPU, Memory, IOPS, Network, etc.), including monitoring application performance metrics such as response times and transactions. The next step is to analyze all the data and determine the best target instance type/SKU out of a massive, ever-growing catalog of configuration options (and prices) offered by the cloud vendor. Then, once the target configuration has been identified, additional constraints must be considered, such as organization policies, OS drivers requirements (e.g., NVMe or ENA on AWS), and storage type support (e.g., EBS optimized/Azure Premium LRS). This image illustrates the multiple dimensions that should be considered when scaling a workload:
I highly recommend reading the blog ‘Cloud Cost Optimization is Beyond Human Scale – Here’s Why’ for more details on the complexities behind cloud cost optimization and how to use Artificial Intelligence (AI) to solve this.
The Journey to automation requires trust
To get organizations to agree to automate actions, you must earn their trust that the actions are accurate, safe, and will not hurt the performance of the applications, especially in production.
Trust requires time and a structured approach; it is a journey with multiple stages, and it is closely aligned with our Public Cloud Maturity Model:
- Start with visibility: The first step is to gain visibility into the entire cloud environment and get a sense of the optimization opportunity at hand. This step includes identifying and aggregating all accounts and subscriptions, understanding the overall spend and the commitments made to the cloud providers, as well as tagging and labeling the different workloads based on their purpose, owner and environment (prod, test, dev, etc.) This must be done across all subscriptions/accounts.
- Tackle the “Low Hanging Fruit” first: The first area that we recommend starting with, mainly since it is the path of least resistance, is terminating unused resources such as idle unneeded VMs, load balancers, public IPs, unattached volumes, and old snapshots. A significant amount of savings can be gained in this stage.
- Purchase 1 Year Reservations for Production: We also recommend that while focusing on non-prod, you should consider purchasing 1-year reservations for production. The reason is that optimization takes time - there is no way around it. By purchasing 1-year Reserved Instances or Savings Plans on AWS, you will be able to obtain 30-40% savings with these reservations as you hone your more advanced optimization skills on the non-prod estate. The reason for 1-year vs. 3-years is that the goal is to build a more sophisticated optimization plan for production during the first year, which will include scaling the production workloads to their optimal size (e.g., rightsizing) and then buying new reservations based on the optimized instance type/SKU.
- Implement scheduled suspension: Suspension of non-prod workloads after hours can yield instant and rather substantial savings, for example, suspending workloads between 6 PM – 6 AM can reduce compute costs by 50%, the savings will be even higher if suspending during weekends and holidays.
- Execute IaaS Scaling in Non-Prod environments: At this stage, the savings are noticeable, and many teams are eager to obtain more savings. We recommend leveraging the BU’s motivation and tackle the non-prod. environments with scale actions. We created a maturity curve that focuses exclusively on that stage since it is critical for the success of the optimization efforts:
- Start with Manual Execution of actions: Review every scale action to validate its accuracy and scrutinize it with a handful of stakeholders from various Business Units (for example, stakeholders from IT/Cloud Ops, Application Team, and finance). Execute the action and validate the impact. Take one step at a time and increase the number of actions executed as the confidence grows.
- Approval workflows: The next step is to implement an approval workflow with your ITSM solution (such as ServiceNow). The optimization scale actions should be routed to the appropriate owner to approve, reject or suggest an adjustment to address elements that were not considered or available when the action was generated. For example, “the suggested instance type is not ideal for this workload since we are planning to double the transactions it will process starting next week.”
- Maintenance/Change windows: As for when to execute the scale actions, start by defining a weekly change window where all approved scale actions will be executed. Over time, expand the scope and frequency of the change windows. Many of our customers are using daily change windows to execute scale actions against non-prod. workloads, the mature ones have moved to full real-time automation, which is the goal.
- Purchase Reservations for Non-Prod: After the majority of the long-term non-prod. workloads have been optimized to their ideal compute configuration, you can now purchase reservations to obtain additional savings.
- Focus on production: It is time to tackle the production workloads. Leverage all the lessons learned from the non-prod. and apply them to the production, following the above steps.
- Enable Real-time Automation – as mentioned, some of our mature customers have enabled real-time automation, however, some were able to do so faster than others since they modernized their applications (more on that in the next section).
Application Modernization and Cost Efficiency
Since scale actions on the cloud are disruptive, not all workloads can be resized often, some require graceful shutdown of specific services as part of the scaling process. When an application is modernized to leverage cloud-native architectures and PaaS services, it unlocks the ability to take optimization actions in real-time and leverage automation, without any impact on the application.
Therefore, it is critical that organizations, in parallel to their continuous optimization initiatives, invest in Application Modernization and architect their applications for cost efficiency by leveraging PaaS Services and cloud-native technologies such as containers and functions (aka, Serverless).
Check out my blog ‘The Top 2 Challenges of Next-Gen Applications’ to learn more about the typical challenges organizations face when building-gen applications.
Stay tuned for the next article in this blog series, Mastering Cloud Cost Optimization: Cloud Cost Models & Discounts Overview. Leveraging the correct cloud cost model for workloads is one of the most effective methods to reduce cloud costs. The upcoming blog will provide an overview of the available cost and discount models on the cloud and when to use them.