Ready queue sucks! There is no way around it, and if you are a VMware shop you are pretty much stuck managing it until either they change something or you move to another technology. As my colleague Joe in a previous post mentioned, waiting in the CPU ready queue line can often seem like an entirety (at least for the thread).
First of all what is ready queue?
“CPU Ready = % of time there is work to be done for VMs, but no physical CPU available to do it on (all host CPUs are busy serving other VMs).”
So why is this just a VMware problem?
Well because VMware is the only major hypervisor that suffers from this. It goes back to the scheduler they used when first developing ESXi, this scheduler is from the old Unix virtualization days. Other major hypervisors do not suffer from this problem as they use a different scheduler that allows a VM to access the physical resources it needs asynchronously. But obviously they're not perfect either.
If you’ve read this far you probably already know that it’s a problem and have probably come across the KB articles on CPU ready and what to do about it.
What makes this a problem?
Well I don’t know about you, but the sys engineer in me does not want my VMs waiting to process while there are free cycles on cores, but because there aren’t the right number free at the same time he has to wait.
To look at it differently, you are at a restaurant, you are a party of 4 and there are only 16 seats to sit at in the whole restaurant. You come in and tell the host/hostess that there are 4 of you and you want to sit together, she promptly informs you that there are not 4 seats available you’ll have to wait. While you are waiting 3 people get up, you can’t sit down because there are only 4 seats, but the couple that just came in can take 2 of those seats since that is all they need. This happens, 2 get up, 1 sits down, 3 get up, 2 sit down.
Throughout the whole time you are waiting for those 4 seats to open up while watching everyone else eat and enjoy their food.
How do we fix this problem?
Well there is the obvious choice, don’t use VMware. For most people this is not an option, if you are already heavily invested into the tech you are probably not going to move away from it unless you have to.
Or you can find a tool that looks for ready queue and can give you alerts designed to help you “fix it,” this is an even worst choice as greater visibility or alerts won't solve the problem.
But what about giving VMTurbo’s unique demand-driven control system a try. VMTurbo understands the real time workload demand and meets it with your compute, storage and network infrastructure supply.
It gives you the recommended actions that assure congestion never happens, whether it’s moving a VM to another host or actually reducing the number of vCPUs in a VM to decrease the ready queue of that vCPU count. If you are a one the many VMTurbo user than you know what I’m talking about. Most of the time these actions are already automated in your environment because calculating fixes to ready queue is math that no one wants to do, or has time to do and taking into considerations all of the decisions and consequences of every action is an impossible task, even for a super human.
The last way to help address CPU ready queue is prevention through education.
*This is my PSA on CPU ready queue and rightsizing in general*
By working closely with your application owners and developers, you can not only educate them and show them that in a virtual environment more is not necessarily better, you can be a voice in the room when things like VM specifications are being planned.
I cannot count how many times project managers or developers have come to me with VM requirements and I literally laughed out loud because of the ridiculous amounts of resources they wanted for their servers. For instance I had a developer ask me for 32GB of memory on a server running 32-bit java, so for those of you who didn’t see the hilarity in that a 32-bit application can only use up to roughly 4GB of memory. (Yes I know that has nothing to do with CPU ready, but you get my point)
So the key here is to be involved in the process as much as possible so that by the time the request gets to you it makes sense, and it is something you do not have a problem deploying into your environment.