A recent blog post from VMware’s CTO office discusses "DRS Pairwise Balancing", an allegedly improved version of DRS in vSphere 6.5 and later. "This new feature", the blog proclaims, "is needed as clusters keep on growing larger and larger." Turns out that some statistical measures used by old DRS became "statistical outliers" in larger clusters, that according to VMware "simply disappear as noise due to the vast number of hosts that experience far lower utilization” and thus fall "below the threshold required to trigger load balancing.”
In other words, old DRS does not scale. Enter the new DRS: DRS Pairwise Balancing. “By adding the functionality of pairwise balancing, and ‘simply’ comparing the highest reported utilization with the lowest utilization, these outliers" maintains VMware "might be a thing of the past.” Might.
Today’s datacenters, on-prem as well as in the public cloud, are exceptionally complex environments. They are anything but simple. A medium-sized estate (5,000 VM, not large by any means) involves trillions of tradeoffs to calculate. Here is an illustration that shows some of the tradeoffs in a standard virtual datacenter:
Any solution that doesn’t consider all the above tradeoffs continuously and simultaneously will never be able to solve for today’s applications.
When you contemplate all of the tradeoffs necessary to consider, as illustrated above, the idea of “simply’ comparing the highest reported utilization with the lowest utilization” of CPU and memory feels almost ridiculous. There’s nothing “simple” about the scale of such tradeoffs in a medium-sized environment like the one represented above, let alone one at hyperscale.
What’s perhaps more surprising is the conditional nature of this statement – “might be”. You’d think the folks who created the hypervisor might be able to provide a solution with a bit more certainty than that.
While this might surprise some, it didn’t surprise us at all.
In fact, what would be surprising is if this “DRS pairwise balancing” functionality actually solves the problem.
For you see, the challenge in today’s hyperscale virtualized world is that problems can’t be solved with thresholds. The problem with today’s hyperscale world is that it is a multi-variant problem that must be solved by considering all variables simultaneously.
Stepping back for a moment, what exactly is the solution? How do we make sure across the entire estate, all applications constantly and continuously get the resources they require to perform when they need them? Well, unlike this perspective, which seeks to balance hosts in an ever-growing cluster, what our customers really care about is simpler: they want – actually, they need – their applications to perform.
Assuring application performance should be the goal because this is the only thing that matters – companies, after all, don’t purchase expensive server hardware and virtualization software (or, for that matter, cloud instances) for any other reason than delivering a service that customers will pay for. And if that service doesn’t perform, then the customer receives no value, and will likely stop paying quite shortly thereafter.
Let’s consider this thought experiment: If a vCPU threshold is at 90%, how many more vCPUs should you add? Did you check the host to ensure there is enough CPU available? If you give that VM more vCPU does that lead to CPU Ready queue, impacting the other VMs on the host? If the resources aren’t available what VMs do you move and how does that impact the target host you’re moving them to?
Now this just considers one variable, CPU – what about the other resources that an application needs to perform, such as memory, storage or network resources? Simple threshold-based metrics fail because they are looking at one or two variables in isolation from all the resources necessary to assure application performance.
And the thresholds themselves? They are nothing more than a guess. Why pick 90%? Why not 95%? Or 80% for that matter.
Simple host-based thresholds, however, are just the first point of failure in our hyperscale virtualized world. A more fundamental failure results from the perspective – here, the focus is on the hosts in the cluster, but as noted above no customer spends money on hosts with the goal of balancing them. They spend hard-earned company capital on what their customers care about: Applications that perform when they are needed.
By focusing on the hardware and trying to solve from that perspective, this approach fails to take into consideration the application itself. Is it available? If available, is it responsive? If it’s responsive, does the response time align with customer expectations? None of these questions are answered by focusing simply on host-based thresholds.
But there is perhaps one even more important, more fundamental point of failure in this host-focused, threshold-based approach:
It fails because it assumes failure in the first place.
The problem with the threshold-based approach is that by waiting for some arbitrary threshold to be breached, it assumes failure is a given. It assumes that we must wait until something goes wrong before we can do something about it.
VMware can perhaps be forgiven for this – after all, this assumption of failure underlies all monitoring tools ever developed. By its nature, network-, hardware-, application performance- – whatever type of monitoring – is designed to spot abnormalities as quickly as possible so humans can be alerted and get to work on troubleshooting and root cause analysis.
That’s the problem with monitoring: abnormalities often are transient, temporary incidents that come and go before a human ever is able to figure out the root cause and remediate. The one aspect of this scenario that isn’t transient or temporary, however, is the impact on customers attempting to use your service: Their experience is at best sub-optimal, and their memory of that experience may well linger long after the incident has abated.
Which is why we created Turbonomic a decade ago. We understood that to assure application performance you had to focus on the application, taking into account not only the demand on the application but all the various resources necessary along multiple layers of the IT stack (the “supply chain”) for the application to perform to customer expectations.
In order to do so, Turbonomic represents holistically any environment as a supply chain of consumers and providers of resources working together to meet application demand.
Essentially Turbonomic’s model creates the “perfect market” within every environment Turbonomic manages – perfect in the market-theory sense that the buyers and sellers are rational actors, with information symmetry between buyers and sellers at all times (both see prices of all resources temporally with full transparency – the exact prices, all viewed at the same time).
What’s better is that this model is perfectly adaptable to innovation – innovations like SDN, containers, public cloud, cloud-native technologies are all welcome in our model because their existence increases market liquidity. As in financial markets, the larger and more complex the market – the more actors buying and selling within that market – the greater the liquidity, which enables a more stable equilibrium, and thus, a more efficient, functioning market.
More fundamentally, however, when we created Turbonomic we made a bet on turning the traditional monitoring approach entirely on its head. Whereas traditional monitoring has always focused on the moving target of finding failure fast so humans can be called in to resolve it, we choose to focus on a different moving target: health.
Rather than allowing an application to become sick, losing performance and perhaps failing entirely, we chose an approach that sought out the elusive balance between application demand and infrastructure supply – Application Resource Management or “ARM” – an equilibrium that would keep an application healthy and performant so customers could delight in your beautifully-designed service for as long as they choose.
Seeking this “desired state” assuring application performance is a far more productive approach than allowing things to break and then trying to figure out how to fix them. It’s also more customer-friendly in that we don’t expose them to failure in the first place.
The only way to do this at any kind of scale, however, is to empower the system to discover this equilibrium itself – so that it becomes self-managing.
By empowering “buyers” (entities like applications, VMs, instances, containers, services, etc.) with a budget to seek the resources needed to deliver whatever service its designed to deliver, and by empowering “sellers” to price their available resources (resources like CPU, memory, storage, network) based on utilization in real-time, Turbonomic enables well-managed well-run operations: When demand on a specific app rises, prices of resources that app shares with other buyers rise with it, forcing workloads, in this example, to move.
At the scale of even today’s small datacenters, there is no way for humans to consider all the tradeoffs necessary to assure performance in time to deliver performance. And any solution that considers a few “simple” metrics, reactively, and assumes failure as a prerequisite…
Well, that solution is no solution at all.
So, beware any technical sounding approach that promises things “might be” better by “‘simply’ comparing the highest reported utilization with the lowest utilization.” Considering everything this approach fails to consider, “DRS” might actually stand for Doesn’t Really Scale.