In my previous post I discussed the problem facing most organizations in various stages of cloud adoption – how to deliver application SLA as efficiently as possible, only paying for what they use and only using what they need.
The only way to solve this problem is to find a way to calibrate the allocation of resources continuously based on demand. In other words, match supply and demand. This would result in correctly paying only for the resources you need when you need them, the holy grail of cloud cost efficiency– and the only way to achieve this is through automation.
To be truly elastic there are multiple things that must be considered in every decision:
- The dynamic changes to the demand of resources
- The impact of the decision on the application’s SLA
- The effect on:
- Other workloads in the environment
- Additional components of the same application
- The cost (or savings) associated with making the change
- Alignment with business policies and compliance
How are people trying to resolve the problem today?
Typically today, this is being attempted in three different ways:
- Manual - Resource management using spreadsheets
- Rules - Threshold based single resource rules in real time
- Batch Analytics - Complex batch analytics that run periodically, usually once a week or month
None of these approaches solve the problem.
In this post, I will discuss the Manual approach to solve the problem.
Some organizations attempt to solve this problem manually. Whether the goal is to identify what Reserved Instances to purchase, the correct instance type for each workload, or the best way to migrate applications from their on-premises datacenter; they approach the problem using spreadsheets.
The different teams responsible for each of these distinct tasks create spreadsheets that aggregate data from multiple sources. The teams then pour over and merge the collected data and try to analyze what the answer to each of above questions should be.
The Problem with this Approach
First, this is a time sink and it doesn’t scale. It takes time for people to merge large data sets, identify savings opportunities, and prevent performance issues, and because demand is constantly changing, any issues or opportunities likely would be missed. Most organizations with which I speak view 30-50 VMs as the point at which the manual approach becomes unmanageable.
Second, it uses static data. This means that the decisions are relevant for a specific point in time and don’t update as the demand for resources changes. Some organizations incorporate different heuristics, like using a 30-day average, peak consumption, or 90th percentile of peaks. While all of these heuristics are better than using single point-in-time data they are still based on history alone. If an application suddenly experiences a peak in demand it would not necessarily be captured in history which means this approach will not guarantee the applications get the resources they need.
Lastly, the issue is that in most cases using this approach an organization will select a subset of metrics to consider, and decisions made on this subset could negatively impact the performance of the applications. When considering CPU and Memory but not IOPS, for example, the performance of a disk-heavy application can degrade. The amount of metrics that need to be fed into the decision process is enormous and extremely challenging for people to handle.
Even if an organization finds a way around one of the problems above, they usually make a tradeoff with the others - use a subset of the metrics, a subset of their virtual estate, or approximations. All the different variations to this approach are static, periodic, time consuming and prone to errors. The only way to effectively utilize cloud resources and be truly elastic, you need a platform that understands demand in real-time and is able to make incremental changes accordingly.
In my next blog I will discuss another approach: the Rules approach. Stay tuned!