<img alt="" src="https://secure.bomb5mild.com/193737.png" style="display:none;">

Turbonomic Blog

RCA Revisited: The Valid Reasons for Root Cause Analysis

Posted by Eric Wright on Jul 7, 2016 3:42:48 PM
Eric Wright
Find me on:

In a previous post about the shift away from Root Cause Analysis (RCA) due to the increase in microservices implementations.  It's much more than just a microservices deployment that will change the way that we operate our application infrastructure.  In the article, I actually referred to a distributed application design using AWS infrastructure.

The reason that we are revisiting this article's concept is that it triggered some very interesting activity on Twitter and through conversations with many folks in the IT community.  It felt like a little clarification was in order.

RCA Matters, but in a Different Way

Abstractions create very interesting ways by which we can care less about what is underneath them.  They become a logical boundary to something below or above.  We are able to consume those resources using APIs without having to worry about how content is created/stored/managed behind that abstraction.  This brings up some interesting challenges at the same time as it creates simplicity.

If something fails within the database platform which is hosted in a Database-as-a-Service environment, what is our recourse? We still have do some RCA in the event of loss of service:

  • Is the database online?
  • Is the database access available?
  • Is the data itself corrupted/missing?
  • Is it a security issue?
  • If no to all of these...guest what?
  • Have you tried rebooting it?

Why do we reboot things?

Rebooting as the last (or sometimes first) recourse in an outage is done for a reason.  We have runaway processes, memory issues, CPU queues unable to clear, locked filesystems, and a whole bundle of other potential situations that are resolved when the system is restarted.

In our DBaaS platform it may seem a little odd, but you can actually reboot it too.  Amazon RDS, for example, does have an option to restart the DB instance.  In the case of RDS, it restarts the database across all of the distributed instances and confirms it comes back up in a distributed, protected way to regain access.

Let's move back further down the stack to see the reason that we call it "root" cause.

Hardware Fails. Be Ready to Know Why

Roots are at the bottom of a tree.  Root causes are often at the bottom of the stack.  Roots typically highlight an issue that goes all the way up the tree...or in this case, the application environment.  So, what are some interesting hardware issues that can occur that are hardware related?

  • Fan failures - Computing processes create heat.  Real, physical heat.  Internal systems are cooled by fans and ambient air conditioning units inside server rooms.  If these were to fail, overheating will occur which can create odd behaviours within CPU, memory, and storage platforms.
  • Storage Issues - There are too many to list because storage is complicated and has many interdependencies within each hardware storage environment.
  • Is it plugged in?  - Cabling, whether it is networking, power, or internal cable infrastructure within the servers and blades, can all fail.  While it doesn't happen often, it does happen. Sometimes swapping out the cable or even just making sure it is seated properly can be the fix.
  • Thanks for the memories - Memory fails in some rather interesting ways.  There is a reason that some memory is shielded, and some doesn't have to be.  That stretches to flash cards as well.  They may not have moving parts, but they aren't flawless either.

Keeping the Roots as we Climb the Tree

The message that I wanted to highlight here is that RCA will continue to be a part of problem determination in response to application and system outages.  What we do see happening is that the industry is making a strong shift towards providing more distributed resiliency to allow applications to survive underlying partial outages better.  As strong as the shift is, it also moves at a glacial pace like all of the fundamental changes that have occurred in our industry.

Don't put away the allen key for your servers just yet, because you may still have to pop open a few to solve some classic outages for a long time to come.

And yes, I know that most servers don't need allen keys any more either.  I'm sure I'll get a few comments on that :)

Topics: Servers and Hardware

Subscribe Here!

Recent Posts

Posts by Tag

See all