I hate alerts. Always have, always will. I hate the noise (not sound-noise, but noise-signal noise). I hate the wasted time and distracted focus from what matters that alerts cause.
But let me step back a bit...
I’m from Cincinnati, home of the Bengals and the Reds and my home airport. If you fly to Cincinnati, however, you actually do not land in Ohio. You will land just south in the state of Kentucky.
By now I’m sure you’re wondering - what does this have to do with my hatred of alerts? Let me explain.
In a modern jet there are many different alerting systems. There is the one in particular I want to focus on when talking about virtualization and other infrastructure alerts. This is the Ground Proximity Warning System (GPWS). This system is responsible for alerting the pilot to various conditions where the plane and the ground are in danger of coming in contact in a way that is not good. One of these conditions is known as an Excessive Terrain Closure Rate condition. This occurs when the plane detects that the ground is rising up towards the plane at a rate that indicates that the plane could be heading into the side of a mountain and that the pilot may want to ascend and avoid the ground.
This alert works as intended most of the time, unless you are on final approach coming in from the south to land at my home airport. The south-bound approach brings you in coming over the Ohio river. On the Kentucky side the ground rises rapidly to a plateau right before the plane lands on the runway. Every time a plane comes in for a landing the GPWS alert goes off. This is well known among pilots. This very important alert is ignored by pilots as a false alarm every time they land.
Just like these cockpit false alarms, operations is bombarded with hundreds of alerts and red lights that they’re forced to react to. This is why I hate them so much. There are many reasons to hate them but they can be grouped into three main reasons why.
Reason 1 - Most Alerts Mean Nothing Until They Get You Fired
When I was in operations I had a folder that email alerts would get tagged if they were deemed noise. I labeled this folder /dev/null which is a play on the unix file structure where you would pipe things you wanted to never see or forget.
This was great most of the time until one small alert that I felt was nothing turned out to be a huge issue. It was a full disk alert that I had seen 100 times before. It was a log file disk that was reporting 99% full. I saw them all the time and 99.9% of the time the team that I was a part of ignored them. There was a cron job that normally ran that would drop logs off the machines into a repository for long term storage. The alert was normally just a trigger for the clean-up job to execute.
The only issue was that the process that normally cleaned this up used an older service account. This account was on a list of IDs that were targeted to be disabled because the LDAP team was cleaning up our standard service accounts. Since this ID was disabled it could not push the logs off the system and into the logging server. This meant that the logging partition was full and the application could no longer write details to the log. The application started to throw exceptions that were also not able to be written because of this condition. This led to some intense war room sessions as we tried to track down the problem.
Since most alerts are typically treated with such disregard, small problems can explode into big problems really fast. The only way to really remedy this problem is to cut back on the number of alerts -- not add more. Focus on driving your virtual platform to the best performing configuration that’s also its most efficient.
Reason 2 - Most Alerts Are Based On Something That Probably Does Not Matter
We know about those particular alerts that are setup in the systems we monitor. Once, a long time ago, this one application had this one really weird problem. Any time x would happen it would cause the whole system to go down. So an alert was created to detect when the symptoms of the problem were detected so someone could restart the service or the machine. Then after several patches the problem was corrected -- yet the alerts remained.
A lot of monitoring solutions have pre-canned alerts based on best-practices and static thresholds. They try to cover 70% of common issues or known catastrophic conditions. I can remember turning on BMC Patrol agents and watching the whole console turn red until I disabled all of the base alerting modules for that new agent group. Email boxes are full of these alerts which lead to my first reason why I hate them.
To get around this operations has to again focus on getting the platform to the best known state and then keeping it there. This allows for the alerts that are generated to be about real problems that need human interactions. Its not about the 80% you don’t care about. It is about the 20% that require you to do root cause analysis.
This leads me to my 3rd reason to hate alerts ...
Reason 3 - Most Alerts Distract You From Real Problems
We have all done it. Chased our tails trying to find a problem based on the meaningless and generic nature of the alert message. I have countless stories of getting generic alerts that could be anything and sent me on a wild goose chase trying to find the answer. The fact is that most alerts generated are symptoms of the problem not the root cause. A lot of investigation has to take place in order to really find the real cause of a problem and there are a lot of trails that lead to dead ends found in the process.
These challenges are only compounded when you try to add in virtualization, agile development models and DevOps. The speed at which these two methods want to move causes these challenges to be exacerbated. There is more noise hiding the real alerts, making it hard to find the real problems.
So how can you deal with these alerts in a new way to help suppress the noise and amplify the real problems? You could do an alert audit, going through each alert and assessing the validity of each alert. While that would cut through some of the noise it would take a very long time to do. Also what do you do about your changing environment? As soon as it changes all the alerts would have to be assessed again to see if they were valid and if there were gaps. You have traded one alert management process for another.
The trick is to start eliminating alerts automatically. One of the main sources of alerts is the platform. If you could somehow manage the platform in such an automated fashion that the platform becomes a flexible, resilient delivery channel then most alerts will start to fade away and only the real problems will be alerted on.
This level of automation would require a new way of looking at managing the different workloads in your virtualized environments. The current rules-based engines fall down because they are based on past experiences and any changes can invalidate any if not all of the rules driving them. This leaves the system with incomplete alerts or alerts so generic they go off about anything that might be associated with a root cause problem. They simply cannot scale to the speed and frequency of change that is required by the business.
Just like with the proximity alert in the plane. I am not saying get rid of alerts or that you don’t need them. What I am saying is that there needs to be a new way of looking at how to eliminate the false positives. This can only be done by adding intelligence to workload management.
Want to start cutting down on the number of alerts you hate today? Start by thinking afresh about alerts in the first place, and understand that there are solutions out there that can raise the signal out from the noise. Then, you can get back to the important task you were hired for in the first place - and that wasn’t managing alerts.