You may have already seen the Seinfeld episode in which Jerry and Elaine try to pick up a rental car. This could be the funniest, and most poignant examples of the failed concept of reservations.
Agent: I'm sorry, we have no mid-size available at the moment.
Jerry: I don't understand, I made a reservation, do you have my reservation?
Agent: Yes, we do, unfortunately we ran out of cars.
Jerry: But the reservation keeps the car here. That's why you have the reservation.
Agent: I know why we have reservations.
Jerry: I don't think you do. If you did, I'd have a car. See, you know how to take the reservation, you just don't know how to *hold* the reservation and that's really the most important part of the reservation, the holding. Anybody can just take them.
We laugh at this situation mostly because we know it’s happened in real life. It may have even happened to us, but how about the very real situation that we may not even realize is happening right now in your data center?
But I Have a Reservation?
Just recently, a Twitter conversation came up around the challenges with workloads that are performing poorly:
We’ve all seen this before. A virtual machine starts to underperform, so the natural reaction is for someone on the admin team to add reservations to the CPU or Memory or Storage shares with the idea that this creates a “better” situation. There is a fundamental problem with what we’ve just done though – as I highlighted recently in my article about Ready Queue.
Not only do we have to worry about how we ourselves are managing resources dynamically in the data center, but now we have all of these manually-created constraints on VMs, created by any of a number of people on the team.
The core of the issue we have is that, by sawing the legs off of the table, we are narrowly focused on the single thing we are attacking – with no view of the other things impacted by the change we are making. In the case of adding reservations, not only are we not fixing the issue, but we are creating issues for other workloads in the environment.
Let’s walk through a common scenario where this happens.
A Story of 24 Virtual Machines
Imagine you are running a VMware vSphere environment with 12 virtual machines on 2 physical hosts. Each physical host has two 6-core 1.9 GHz processors and 96 GB of memory, running on a shared NFS environment with a total of 10 TB of usable space on SAS drive.
The total available resources to the 24 VMs is an equal share that looks like this:
vCPU cores: 48 total cores / 24 VM = 2 vCPU cores per VM
vMemory: 192 GB total / 24 VM = 8 GB per VM
Storage Allocation: 10 TB total / 24 VM = approx. 416 GB per VM
These are the physical numbers – as most people will do when given the option of thin provisioning and shared virtual hardware, we over provision. With our 24 virtual machines, it is entirely possible that owners will request VMs with these allocations:
Memory: 16 GB
Storage: 1 TB
Using these figures, we have a total virtual utilization that is now twice the physical capability of the underlying hardware. This is all well and good based on the theory that there will never be a simultaneous need to retrieve all of the resources. Using the same way that telephone networks use time slicing, we assume that the back and forth between the VM and the host will allow it to schedule out the resources as needed and still satisfy the needs of the application.
Then reality sets in.
Four months later, the workloads are more heavily utilized. We figured out that the SQL server needs to have more memory, so we reserved 14 GB of the 16 GB assigned to it. There is also a web server that needs lots of RAM so we reserved 12 of the 16 GB for that one. One of the testing environments does some powerful data processing, so it gets a reservation of the full CPU for that VM.
We are already creating “solutions” to the perceived issue – that the performance is not being fairly divided between the virtual machines. As the natural, organic increase in workload described above happens, the situation worsens. Worse than the fact that the utilization increases, we now also have individuals on the virtualization admin team each adding reservations along the way to try to give more resources to individual workloads. By failing to understand the environment holistically – and failing to appreciate the way individual application demand is impacts infrastructure supply in real time – we begin to build problems with such manually-created reservations.
Slowly but surely such reservations grow to the point where we can no longer even add reservations because the physical infrastructure is not able to keep up. You’ll find that HA and DRS no longer work well because the resource constraints applied to the guests are impacting the ability to migrate workloads. If a host failure occurs, you’re going to find out the hard way that we have been ignoring the real problem all along by hiding behind growing reservations and limits.
This hasn’t even touched on the storage. Just imagine what happens when these heavily over-allocated systems do nightly virus scans and backups. You can try to offset them from batch processing windows, but the reality is that you can’t know the right window to work with, and even if you do, that window is dynamic. Many environments find that they are actually backing up servers during the business day because of the overrun of the schedule due to the constraint of time and resources.
This is a story that happens every day. It’s meant as a lesson to show you that the solution that creates another constraint is not a solution at all.
So, just like when Jerry was told that he had a reservation, it seems that reserving the car didn’t actually guarantee the car would be there.
Image source: From Seinfeld's Classic episode "The Alternate Side", season 3, ep. 11 found at http://www.amazon.com/Seinfeld-Season-3-Jerry/dp/B0002UE1WQ