Back to Blog

Chris Williamson

Datacenter Management and RCA: How to Ditch the Detective Work

When it comes to datacenter management IT operators, admins, and the like often feel like detectives. It’s a glamorous gig in Hollywood; you get to chase down bad guys, save victims, and always get the girl (or guy). As we know, the reality of crime fighting is hardly a reflection of the silver screen. So why do we accept being treated as detectives in the datacenter?

For the past decade or more we have accepted that sleuthing for the root cause of a problem is the fundamental operation on which IT and datacenter “management” is built. While I grant that rose colored glasses make this job a bit artistic and nuanced, but think of this from a different perspective: your customer. The person whose performance relies on your ability to solve a problem. And as we know, it’s not one problem or one alert at a time. Its dozens or hundreds. Sherlock had it easy.

Sherlock Holmes in the DatacenterRegardless of scale though, if you had the option, wouldn’t you choose to prevent the crime rather than suffering post-facto? Fortunately we have established these types of preventative constructs in our legal system. Yet we still try to chase alerts, troubleshoot issues, and put out fires in datacenter management. Most of us tolerate or—worse yet—embrace such a reactive system. However, ask anyone in this industry how important assuring application performance is and I guarantee you get a scoff, chuckle, or silence. We know it’s important; it’s probably the single objective that most often keeps us up at night (ask your significant other how much he or she enjoys the glow of your laptop display at 3am).

I raise this because most of us don’t understand the disconnect between our goals and how we set about trying to achieve them. We have established that our goal as an industry is to deliver service to our customers and business. But we try to accomplish this by waiting for something to break, for an alert, or for someone to call to complain. What does this mean? It means something is not working. It means that we aren’t achieving our goal of service assurance. It is only now that we choose to break out our detective hats and take some type of action to resolve a problem.

The alternative is, of course, prevention. In an ideal world, we would have a person or datacenter management team constantly examining and fine-tuning an infrastructure, ensuring that each and every workload has access to the necessary resources across storage, network, and compute simultaneously.

So why don’t we do that? Unfortunately that requires two things that are very expensive and that cannot be provisioned on-demand: time and humans. The sheer complexity of virtual infrastructures and the management thereof requires so much brainpower to understand and maintain that we humans simply cannot scale accordingly. This is why we must resort to the aforementioned break-fix management strategy. Or do we?

As I mentioned before the solution to the problem is not humans but software. But first we have to change how we think about the problem before we change the solution. After all, the problem is not how do I troubleshoot faster? The problem is how do I prevent issues from occurring in the first place (that would require troubleshooting)? It is only then that we can reasonably implement a solution that enables us to achieve this novel goal.

So what does it take to do this? Instead of defining a threshold (a place we don’t want to be), let’s leverage software to process hundreds of thousands of data points across an environment and execute the right decisions in real-time. This approach to datacenter management allows us to, for the first time, maintain a perpetually healthy infrastructure. Only now are we truly satisfying our initial goal: assuring application performance.

Only once we are able to deliver service to the existing workloads can we begin to talk about efficiency, another widespread objective in IT (after all, most of us got into virtualization for this reason). Put another way: if you can’t assure service in your existing environment, how do you expect to add workloads, thereby increasing complexity, and in turn the ability for humans to manage the datacenter?

It’s only after we are able to provide reliable application performance that we can begin to talk about density improvements and increasing efficiency. And it is only with real-time, software based control that we, as an industry, can expect to deliver that reliable performance and subsequently drive density safely.

Simply packing more VMs onto your infrastructure does not add value. If those workloads, now very densely packed, cannot perform then what good is our allegedly more efficient infrastructure, then the density improvements are moot. After all, our goal is to assure performance. However, if we leverage software to now define the target state of our infrastructure, we can safely drive that higher, thereby increasing VM density. Remember this can only be done and sustained if we are making the correct resource management decisions the first time, every time, in real-time.

Without this real-time control—constantly matching workload demand with infrastructure supply—we are still left chasing alerts, firefighting, and fielding calls from unhappy customers. As glamorous as the Sherlock Holmes fantasy may be, it would take a lot to convince me that the reality of IT operations is remotely reminiscent of Sir Arthur Conan Doyle’s tales.

So while some may celebrate or glorify such menial tasks, I would much rather leave my laptop home on the weekends. I would much rather walk into my office in the morning to a dashboard that looks like these and a phone that remains eerily still while I work. Fortunately this reality is achievable once we a) begin to think differently about the problem we are truly trying to solve, and then b) implement a platform that executes that solution. You tell me; which of the environments below would you like to see every day when you arrive and when you leave?

datacenter management


Holmes and Watson

No $%^&, Sherlock.