In my latest blog “Less Troubleshooting or Less Troubles” I questioned the goal of modern monitoring tools to collect as much data as possible with the highest level of granularity. I argued that the monitoring market is so caught up in the race for more data, it has forgotten its true purpose – keeping your virtualized environment performant and efficient.
A week had passed, and as if responding to my blog, VMware published another blogpost; “Customizing vROps: Decide Which Metrics to Collect”. This blog discusses different ways to reduce the amount of data collected by vROps in order to improve the performance of its cluster (!!) and the team’s ability to interpret the number of metrics collected.
Before we talk about the blog, let’s take a few minutes to think about what it means to keep your virtualized environment performant and efficient.
Performance is an easy goal to define (but very hard to achieve). Our business is driven by the way our applications perform, so a performant environment is an environment where users get the service they need from the applications quickly and consistently.
Efficiency is a harder goal to define, but we will try anyway. Virtualized environments consume resources in order to perform. Resources can be physical or virtual, including servers, storage, power, cooling, computing, memory and more. This also includes the human resources that are required to build, plan and run the environment, and in some cases, monitor it. So, keeping your environment efficient means that we use only the resources we need, when we need them. It also means that we know how to balance the tradeoff between the different resources and that we manage our environment in a way that doesn’t over or under-utilize the different resources we have.
Going back to VMware’s blog; the author provides 3 different ways to “dial in” the amount of data collected by vROps, with the purpose of reducing the CPU, memory and disk resource needed to handle all the data that is collected. The author further explains that this fine-tuning of collection helps “to gain greater capacity and performance within your vRealize Operations cluster” and theoretically providing you with better alerts (e.g. less noise).
While I am happy to know that monitoring tools acknowledge that this race for more data comes with a real cost (in performance and budget) and even offer ways for customers to opt-out of it, there are still several significant issues raised by this blog that merit further discussion.
Monitoring tools can significantly impact performance and cost-saving.
VMware is saying it, not me; that’s why they offer ways to “dial in” the scope of collection and the resulting impact on performance and cost. Nevertheless, the default configuration of monitoring tools is to collect all the data, and the means to tune the scope of collection are complicated and require the system admin to master various user-interfaces, methodologies and skill-sets. So you start with a configuration that places heavy strain on your CPU, memory and storage resources, and requires you to increase your spending accordingly (do you even know the cost of CPU, memory and storage consumed by your monitoring solution?). But even worse, the only way to reduce this impact is by investing endless man hours in an endless fine-tuning process.
What is the right amount of data?
This is where things get even more confusing! One blog says that the more data you collect, the better your alerts are (closer to real time and with less false-positives). Another blog says that the more data you collect the more congested your environment gets. There is clearly a tradeoff between scope of data collection and impact on the environment, and VMware prides themselves on giving you the tools to decide the right blend.
Unfortunately, managing this tradeoff is a challange – even with a plethora of tools. With no real guidelines, advice or metrics from VMware for managing this tradeoff, IT teams are left with yet another unsolvable problem (in addition to the real problem). There are many aspects and tradeoffs in the data center that can be determined only by IT (and no company in the world can or should do it for them), but properly configuring a monitoring tool is definitely not one of those.
Are you Monitoring your Monitoring tool?
What the two VMware blogs uncover the most is the difficulty of managing an IT monitoring solution. In theory, your monitoring solution should seamlessly help you maintain your environment in a performant and efficient state, as I’ve argued before. In reality, monitoring tools have become so complicated that they need a monitoring tool for themselves, and they need experts (or paid professional services) to help you find that sweet spot between collecting enough data and not impacting the environment. If you own a monitoring tool, ask yourself: how much time do you spend on monitoring and managing your monitoring tool? And what other important tasks get put off as a result?
If you reached this point of the blog, then you have already spent more time on your monitoring tool than you should :-). Maybe it is time to get out of the break-fix loop that keeps these monitoring tools relevant and step into the era of autonomic performance assurance – your environment is ready.