There is a lot of conversation happening around DevOps all across IT organizations. It was born out of the Agile movement and has started to permeate to other areas. A lot of these conversations are taking place in the application design and implementation to try to create the same time of speed that public clouds like Azure or AWS are offering.
Existing virtual platforms are feeling the effects of the increase in the need for faster deployment times while maintaining the same level of performance that is required to support the business. This is why we need to address some of the challenges virtual infrastructure teams face when trying to apply DevOps methodologies specifically for their environments.
It’s all about the flow ….
The core of DevOps is all about increasing the flow of delivery. For applications, it is about how fast can I take application bits from a developer’s desktop to production. For the virtual and cloud environment it is how fast can I take a request from someone for compute resources and deliver that back. There are many schools of thought on how this can be achieved and we are not going to dissect them all here. The main thing to look at is not the technical implementation of the workflow, but the workflow itself.
Let’s work from an example. You have a request from engineering for 20 VMs of various sizes that need to be deployed to support a new project. Typically, this comes in as a ticket and is either processed by a human or some level of run book automation to spin up the VMs somewhere in the environment. Sounds good, right?
We have an immediate, but often ignored problem with the impact that this inbound flow has to all workflows that already exist in the virtual environment. Let me pose a few questions for you to ask yourself in the context of this example:
- Where does this workload land?
- How do you make that determination?
- What is the impact of this new workload in relation to existing workload?
- How do you reconcile it with capacity planning?
- How do you reconcile it with your deployment pipeline?
If you answered these questions with things like “We have reports/excel spreadsheets and meetings” or “I am not really sure,” you are not alone. If the answer is that a human intervenes in any of the decisions, we have a flow problem. This may also be the core of much more than just a flow problem as well.
One of the first principles of a DevOps culture shift is understanding the process as a whole. Delivering compute services has three main components to this:
- Managing the runtime environment
- Designing new change
- And deploying those changes
It has been my experience over the years that these components have been put into silos to try and maximize it as a shared service. This was totally fine before virtualization and now cloud became a standard in most environments. Change was slower, more predictable, and there was plenty of lead time to adjust for any course correction along the way.
However, with the explosion of virtualization and self-service portals these silos are becoming more of a hindrance to the flow. When you start to look at the process as a whole you can start to see how these silos are really dams in the process.
Despite the misconception that developers don’t know about infrastructure, and operations teams don’t know about development, the truth of the matter is that DevOps is built on a foundation of not having to be concerned about the “other person’s stuff” in the deployment. This is the heart of much of the DevOps movement.
Managing against the storm ….
Looking back at our example the first thing is not to understand where to place these 20 VMs but how is the health and performance of my current environment? If the environment is unhealthy, the decisions on where new workload goes will only exacerbate problems.
If these VMs are placed in the wrong cluster, you may be introducing performance issues that have to be dealt with after deployment. This small introduction of flaws creates management challenges. These challenges can lead to things like multiple small clusters where only certain types of workloads live. There are logical reasons to create clusters, but performance is not always the result from those designs.
How do you manage compute at scale?
One of the teams I was working with stated their clusters were standard size of twelve servers. When I asked him why twelve he said because they tried to do twenty but the amount of time it took for maintenance and management was too much at that size so they made it smaller. They suffered from the complexity of compute at scale.
This complexity is bound by three decisions that have to be made:
- Where should my workload live in the context of all my other workload, at any given point and time based on the demand of all my workload?
- Is the workload sized appropriately in the context of what the application needs in order to perform optimally while minimizing my foot print?
- How does the current demand impact my future deployment and capacity decisions?
On a side note, these questions and the hindrance impact to flow that they cause when it comes to managing a virtual platform is a big driver for business to explore public cloud offerings like AWS. The perception is that you really do not care because something else is figuring that out for us. But while AWS abstracts the underlying infrastructure, it does not abstract these decisions—learn more in this Public Cloud Guide.
In order to speed up your ability to deliver compute services in the context of DevOps you have to increase the ability to handle a higher flow rate. In order to do that you have to find a way to enable the workloads answer to those questions for themselves.
You have to start embracing that software can empower virtual environments to self-organize much faster than humans can, responding to sudden increases in workload traffic or decreases in the availability of compute, storage or networking. A self-organizing virtual environment change the way you design your data centers.
Clusters fall away to this idea of one giant pool of liquidity. Boundaries of where workload lives are driven solely by quality of service of the applications. All because the virtual platform can self-manage. This frees up massive amounts of technical debt all to be paid in the next two silo areas: design and deploy.
DevOps: What’s Next?
We’ve reached a key point in the discovery process here. We understand that flow needs to be increased and the easiest point of entry into this is attacking the problem where you can get the most time back to pay down debt in other places. In my next article we are going to start to look at how do you take this new time capital and spend it earlier in the workflow to start to accelerate it.