It is more than five years since our application performance control system was deployed in our first customer environment, and from the beginning I was asked to provide a “headroom dashboard/report”. I was able to resist this ask for a couple of years but eventually had to acquiesce; a few years ago we reluctantly released our “headroom dashboard.”
For the last five years I am baffled by three simple questions with regard to the “Headroom Dashboard.”
- What does the headroom mean? Do we really understand the question we are asking?
- What do you do with the data in this dashboard?
- Why do our customers like this dashboard so much and keep asking for it?
These questions seem simple and obvious. But are they?
What Does the Headroom Mean?
When asking our customers what they mean by headroom the answer is always the same… “The headroom answers the question of how many VMs can fit in a given cluster.” Simple? Not really. I don’t think this question can be answered without knowing the size of the VMs we are trying to fit. Wouldn’t the answer be different for different size VMs?
Lets try to refine the question… “The headroom answers the question of how many VMs of a given size (template) can fit in a given cluster.” This is better. Now at least we can probably give a number as the answer, but does this answer mean anything? Are all the VMs that will be deployed of the same size (similar to the template size)? If not, the answer for this question has very little relevance.
Let’s try again… “The headroom answers the question of how many VMs of an average size can fit in a given cluster.” This is getting interesting.
- Is it the average size of all the VMs in the universe?
- Is it the average size of all the VMs in the enterprise?
- Is it the average size of all the VMs in the data center?
- Is it the average size of all the VMs in the cluster?
- Is it the average size of all the VMs on a given host?
Furthermore, is the average size the average of the VMs’ memory, CPU, Storage or Network? Is it the average of the average consumption of these resources or the peak consumption of these resources? And, if the question is about the peaks, will all VMs peak at the same time? Obviously, the answer to the headroom question depends on the answers to these questions.
Let’s assume we agree on the answers to the above questions. Can we get a meaningful answer to the headroom question? The answer is NO!
Let me explain using a simple example. Let’s say we have a cluster with 10 hosts of size 10. There are 15 VMs in the cluster of sizes 5, 3 and 2. There are 5 VMs of each size. The size 5 VMs are placed on 5 separate hosts and on each of the other 5 hosts there are a pair of VMs of size 2 and 3. Each host is 50% utilized and the average size VM is 3.33.
How many VMs of average size 3.33 fit in this cluster? The answer is “I don’t know.” The answer depends on the mix of VMs we want to fit in the cluster. If all the VMs are of size 3.33 we can fit 10 VMs in the cluster, but if the VMs are of different sizes, let’s say a mix similar to the one running on the 10 hosts, we can fit 15 VMs. Is the answer to the question 10 or 15? I don’t know. Furthermore, if a minute later the 15 VMs in the cluster will be rearranged, the answer to the headroom question will be different.
So, what does the headroom question mean? The answer is nothing. The headroom is a bogus misleading number. This leads to the second question.
What do you do with this data?
I always believed that dashboards and reports are there to serve a purpose. I always believed that if someone looks at a report he is doing so because he is trying to solve a problem. So, in all my conversations with our customers I am trying to figure out what is the purpose of the headroom dashboard? What do they do with the headroom information?
The answers I hear may sound different, but when I think about them, they all boil down to three basic decisions we are continuously trying to make:
- Where should we place the workload?
- Do we have enough capacity to accommodate the workload?
- What and when should additional capacity be purchased?
Given that the headroom is a bogus number, is this the right data to use to make these decisions? And if you have dozens of clusters, thousands of VMs, workload demand that continuously fluctuates, and every day you have to place tens of VMs, is a dashboard of all your clusters with their “headroom” what you want to use to make the decision of where to place these VMs?
There must be a better way. Wouldn’t it be better if, instead of looking at headroom numbers and plugging them into spreadsheets to try to figure out the answers for these questions, software would provide us the answers? Instead of asking for headroom, why won’t the software tell us where to place the workload, or better yet, actually place the workload for us? And while we’re at it, shouldn’t the software also tell us what capacity and when we should add to the environment?
Why do our customers like this dashboard?
Given the answers to the first two questions, this one baffles me the most. Let me offer a few possible answers.
First, we like data. Data gives us the illusion of control. If we have the data, we know what’s going on and if we know what’s going on we are in control.
Second, we like to make decisions and we like to support our decisions with data, even if the data is bogus.
Third, we didn’t have a choice. In a world where all we had were spreadsheets or glorified spreadsheet tools with fancy user interfaces, we had no ability to continuously understand the actual demand nor the real consumption, and relate it to the infrastructure supply. All we could do was use static allocation-based guesstimates to figure out bogus numbers, plug them in and feel better about the guesses we were making.
Fourth, inertia. We got used to using this data and no one challenged us!
Let Me Finish…
Let me finish by going back to the second question.
It is time to stop using bogus data to make these decisions.
It is time to stop staring at data and guessing the answers. The world we operate in is too complex and too dynamic.
Answering the questions of where to place workloads and when and what capacity to purchase requires real-time simultaneous analysis of many continuously fluctuating dimensions. This is beyond human scale. We “got away” with answering these questions using headroom analysis based on static allocation using (glorified) spreadsheets. It was simply inaccurate and wasteful, but now, with the increase of infrastructure utilization, this is simply operationally dangerous. It must be done by software.
Furthermore, with the added complexity where demand includes not only traditional, monolithic, relatively long-lived applications, but also VDI and containers with different consumption traits and expected lives, and supply expands to on-demand public clouds where headroom has no meaning due to the infinite available supply, how can we possibly continue to make decisions based on these bogus numbers?
We can do better. We can move forward and retire the glorified spreadsheets tools. We can move from static allocation based placement and capacity decisions that are disconnected from reality to real time consumption based decisions that are driven by the real workload demand at all times. We need to let software drive these decisions and control the environment in a desired state.
We have a choice!