Woah, that was fast. 80%+ virtualization levels hit us like bell-bottoms in the 60’s. In our conversations with operations teams, the dialogue is changing. It’s no longer about P2V migration. It’s about maintaining control as workloads multiply: How do I manage VM sprawl?
IT ops is challenged by “the business” to provide exceptional quality of service to support a host of new applications – at an appropriate spend level. But what happens when application owners' eyes are bigger than their stomachs? One senior admin shared with us, “as far as the business is concerned VMs are free.”
Want to manage VM sprawl? Best arm yourself with showback data. Here’s the play-by-play:
1. Mr Business, we’re investing in new platforms to efficiently support you:
You’ve invested in new storage platforms such as VNX 5700 & 5500 and EqualLogic 1600, and are applying thin provisioning to make the most of that investment. And you’ve purchased monitoring tools like vCOPS or Foglight to get visibility into production, support, and development environments. But resources and the investment is wasted when the business asks that you over allocate resources.
2. Meanwhile, you’re asking for twice as much resources [CPU, memory, IOPS] as the app needs
This challenge is further exacerbated by the resource requests themselves. The same admin elaborated, “The business is asking for a ridiculous amount of resources which may not be necessary. Some of these vendors that the business works with say, ‘add more memory,’ ‘add more storage,’ but without a real understanding of what is being used.”
We find that packaged applications with vendor-dictated specs are the worst offenders as these requirements are often based on physical sever deployments. Why vendors can't update their specs for the modern age (see first sentence of this blog) is beyond our comprehension, too.
3. Performance and efficiency is better if we correctly allocate resources
Over allocating can often do more harm than good resulting in high CPU ready metrics.
Let's go down this path. If you're like our customers, the IT department has become "yes man" culture. The culture push is to provide the business with the requested resources. We find in our conversations that there is often no mechanism to push back, but no budget to dramatically scale. VM sprawl is inevitable. Also inevitable is performance issues and inefficiently utilized hardware.
The degradation in performance may be due to LUN latency, workloads waiting around for CPUs, or a host losing its path to the data source along with a plethora of other potential root causes. Isolating the true root cause is a significant challenge for example, a typical cause of high %wait (excluding %idle) can be a result of a poorly performing storage device where the VM (files) resides. As such, storage latencies can be examined for the VMs in question, and whether other VMs sharing the same LUN or array are also experiencing high storage latency.
4. Here’s what your application needs - can we rightsize it?
So how do you regain control? Implement that feedback loop. Arm yourself with showback data that shows which VMs are over-provisioned or dormant. This data must be easy for you to produce (who wants to spend their mornings on reporting?), and easy for the business stakeholders or application owners to understand.
We think the dashboards in VMTurbo are good for this purpose (but we're biased). You can report back and generate reports on CPU, memory, IOPS, latency and storage utilization by specific apps or groups of apps on a regular basis. This will not only keep the business in check but also give you ammunition when you know you are running out of resources:
VMTurbo's Operations Manager gathers a holistic set of metrics from the underlying hypervisor covering CPU, memory, storage and network resources. Through a broad range of resource allocation decisions including VM placement and sizing actions, VMTurbo resolves and prevents performance issues from occurring, while maximizing the utilization of underlying physical IT assets.