We keep hearing more and more OpenStack arrangements lingering in the background of our customer’s overall go forward strategy. I think to myself: who can blame them? The allure of OpenStack (aside from being free….) revolves much around a modular approach to project development and open source community coming together for the common goal of saving the world from expensive licensing costs and vendor lock in. It’s a great pie in the sky that offers the innovators of the virtualization age the opportunity to create a truly flexible Infrastructure-as-a-Service (IaaS) platform.
But with all of the pearly white beaches that OpenStack offers, it seems that a number of our customers toying with the platform haven’t fully plunged into an enterprise infrastructure powered solely by OpenStack just yet.
I guess the old adage remains: if you have money, you buy VMWare… If you have time and people, you should build on OpenStack, right? Let’s take a look at why this is a common thought process.
The Challenge of Manageability in OpenStack
The challenge is that the latter implementation introduces manageability concerns associated with moving OpenStack into production. Things like fear of underdeveloped instrumentation, quality of service violations, user error, organizational adoption, and questionable vendor support all come into play. I have personally been chewing on one specific use-case that I believe organizations will inevitably need to master in order to accelerate OpenStack’s adoption: Intelligently deploying and live migrating workloads within an OpenStack environment.
We all know that vMotion capability remains the lynch pin of our flexibility and agility on platforms like VMware, but accomplishing its equivalent for live migration on OpenStack will be a very different animal. With our hopes of flexibility and elasticity on OpenStack, the inability to manage this animal intelligently will inevitably make or break the success of a full IaaS implementation.
Let’s work through some of the challenges associated with meeting workload demands in OpenStack through initial and ongoing placement. For now let’s assume that I am using an OpenStack/KVM environment with NFS Shared Storage for live migration and I have already configured my NFS server in the controller node and mounted NFS to all the other compute nodes for fully shared storage.
Let’s also assume that I have followed the appropriate steps for Nova scheduler to pick up my instance requests and launch new images into the environment for initial deployment. Nova-compute acts as the brain responsible for executing changes. These may include provisioning and managing workloads in real time, while Nova-scheduler is responsible for querying instance requests and determines which compute node the instance should be deployed to. Instrumentation and logging of the activity is happening in Ceilometer, which gathers performance and monitoring metrics for the purpose of presenting this information to a user or administrator.
Figure 1: OpenStack Projects
Now that I am set up for live migration, let’s paint a real life scenario where I will play the administrator and it’s time to control my workload.
My administrator instinct tells me to get free monitoring solutions to start so it put together a hodge podge of CLI commands, scripts, and libvirt plug-ins that give me additional queried metrics on my VMs for CPU, Memory, Network and storage above and beyond how Horizon represents it. I even fancy myself to the alarm capabilities of OpenStack and have configured alarm thresholds to notify me when my host resources have become high. Here’s one for CPU on my host for 70%:
Figure 2: OpenStack CLI Command
Well we all know what happens with alarm thresholds… we cross them!: DING. An alarm comes through notifying me that my host has reached the place I didn’t want it to be past 70%. I confirm on by checking the Host as suggested to me by OpenStack’s best practices:
Figure 3: OpenStack CLI Command
After comparing Virtual Machine CPU and Memory metrics in ceilometer and horizon dashboards, I determine that I need to move one of the bigger VMs and I check the rest of compute cluster to see where to move it to:
Figure 4: OpenStack CLI Command
Before I click go on the live migration command, I check network metrics to make sure that my throughput is healthy prior to migration. I then green light it and let it fly.
I wait 10 minutes and then notice that another alarm has gone off. This time it is related to memory congestion. Fear settles in as I grapple with the decision that I just made: I failed to check for the memory usage of the instance that I just migrated to Host C and ended up moving the problem elsewhere in my environment!
Sound oh too familiar doesn’t it? It also sounds like something that we wouldn’t want to be doing manually, at all.
It’s obvious that process of live migration for a single virtual machine on a single constraint is complex, but what happens when we move to a full IaaS model that leverage Nova scheduler and to orchestrate VM deployments and auto-scaling of instances for scaling groups? Nova scheduler might actually decide to place the workload on the exact same Host that I had originally chosen for its availability on CPU! How can I be confident that Nova scheduler is intelligent enough to determine initial placement when I literally just got burned myself? Placement is only a point-in-time decision, but the lifecycle of that instance means it should be continuously revisited.
This is where the idea of “load balancing” compute nodes becomes irrelevant. Even in the use-case above where I had my threshold was set to 70% on CPU – would I really be assuring performance if I changed this to 50%? Or 45% for that matter? I could have migrated my VMs pro-actively to keep it beneath this level, or even created a script to do so for me, but I still don’t know whether workload demand is being met by my supply underneath in either case.
More importantly, there is no way to ensure that we are dealing with all of the performance trade-offs across every variable in real time and the consequences of unlike workload demands interfering in my environment. The only certainty that I have is that my gut tells me that higher utilization creates greater risk to quality of service delay. The result is that many organizations may find themselves in a situation where they will be forced to provision excess capacity to try and mask the operational risk associated with driving higher commit levels inside OpenStack.
This will, in turn, adversely affect the very cost-reduction initiatives that are pushing people in the direction of OpenStack in the first place. While it may not be an observed pain today, one of the core challenges facing OpenStack adoption in the coming months will boil down to this single problem:
Assuring application performance WHILE utilizing infrastructure as efficiently as possible.
It is realistic to expect that a human may be able to solve one or the other within reason, but expecting our operations teams to do the heavy lifting of both OpenStack implementation and service delivery will be a tall order to muster.
We have just scratched the surface on just one of the actions that can be leveraged to control an OpenStack deployment. I will leave you with the following use-cases ponder in the same light:
- Sizing Virtual Machines to flavor specs in OpenStack
- Scale up/Scale out
- VM Deployment and self-service through Nova/Heat
- Cinder consideration for sizing and provisioning Volumes
Just think about it; will this scale in OpenStack? After considering the complexity of doing each one individually – I encourage you to then think about doing each one simultaneously while controlling QoS/Efficiency across onboarding, planning, and real time management. It’s time to change the way we control our infrastructure before it begins to control us.