In the last four decades, initiated by the “Big Four”, IT management has been in a race to nowhere. A race to discover more, collect more and present more. More reports, more graphs, more views, more alerts. If there is an IT asset out there, there is a tool to discover it. If there is a metric that can be collected, there is a tool to monitor, graph, and alert on an exception. How many management tools do you have in your environment? Which one do you use when? How many different reports/views/alerts are you looking at each day? How much time do you spend investigating these reports? Are you in control? Do you sleep better?
IT environments are complex. Managing these environments can be challenging. Many different moving parts with very complex interactions make it very difficult to effectively track, monitor and control the environments. For years, in trying to address the broad range of pain points, we have been throwing more and more management tools into the environment. Every pain point has been addressed with a different tool, ultimately failing to address the pain. Introducing a new technology or product to the environment led to the deployment of yet another management tool. Very quickly we ended up utilizing hundreds of different management tools and products. Instead of addressing the management challenges, we increased the TCO and created an operational and administrative nightmare. In the last decade, I have encountered many enterprises that realized the mess and embarked on multi-million dollar projects aimed at reducing the number of management tools in use to several from a few hundred.
The IT landscape is transforming. Virtualization, cloud and now SDDC lead to increasingly more dynamic, complex, heterogeneous and large IT environments. We are no longer looking at static siloed IT environments where the boundaries between the technology layers and products are well defined and understood. Virtualization brings these walls down. It’s no longer one application on one known dedicated physical machine and attached storage, but rather a complex dynamic boundary-less compute, storage and network available on-demand. Can we scale to these types of environments with the existing brittle intelligence tools and approaches?
The SDDC management problem can’t be solved by acquiring a collection of point, niche tools, throwing them into a basket, putting a “nice” user interface on top them, labeling the basket with a fancy name and throwing it to the market as the answer. This doesn’t solve anything.
The SDDC management problem can’t be solved by breaking it into three separate, non-integrated areas of topology scope, data collection and management functions, as suggested by some. Topology scope and data collection are important, but they need to be done in context and with purpose.
The SDDC management problem can’t be solved by a topology scope that “has more complete coverage” of Physical Compute, Virtual Compute, Physical Network Devices, Virtual Networks, Physical Storage Devices, Virtual Storage, OS, Applications and Public Workloads running in AWS, Azure, etc. but is lacking a common data model to semantically represent all the interdependencies among this broad range of entities.
The SDDC management problem can’t be solved by bottom up Data Collection done by separate niche point monitoring tools, each collecting a lot of different types of “structured” and “unstructured” data points, such as performance metrics, logs, alerts and reports, about different type of entities in the data center. These collections of point tools, usually developed by different companies, are lacking a common abstraction and a common semantics and as such, all they can produce is information (don’t we have enough?) for human consumption.
The SDDC management problem can’t be solved with a collection of Management Functions delivered by a collection of point tools, each function by a different point tool. It can’t be solved by a collection of automation tools, basic performance monitoring tools, identifying abnormalities tools, capacity management tools, change management tools, log management tools, compliance tools and troubleshooting tools. These tools, like the data collections tools, were developed by different companies, and they lack a common abstraction and a common semantics, and as such, as the word “tool” suggests, they are a collection of tools to be used by humans.
The SDDC management problem can’t be solved with a collection of “tools” that produce information for humans to consume and for humans to use. How many tools does a man need to have before you call him a man? The answer, my friend, is “Less is More!”
The SDDC management problem can and must be solved by software in software.
The Turbonomic Autonomic Platform (TAP)
To solve the SDDC management problem we must start with an autonomic platform comprising three core tenets:
- Abstraction – a data model of the SDDC environment that abstracts away the limitless details and provides a common, semantically rich representation for introspecting and controlling the environment.
- Analysis – an intelligent analysis engine driven by the knowledge captured by the abstraction that makes continuous, real-time decisions to control the SDDC environment in a desired state.
- Automation – orchestrated set of actions driven by the analytics engine to control any workload, on any infrastructure, anywhere all the time, controlling running workloads, deploying new workloads and planning for any future changes and trends.
Turbonomic is the ONLY vendor to deliver an autonomic platform to control the SDDC, the ONLY platform where self-managed applications autonomically assure their performance through software and by software, with minimal human intervention only when absolutely needed.
Turbonomic Autonomic Platform (TAP) abstracts the SDDC as a Market of Buyers and Sellers, a Market of Service Entities that trade Commodities they consume from each other. Applications, Containers, VMs, Hosts, Zones in a public cloud, Storage, Networks, Disk Arrays, Switches, Etc., are all Services Entities. Compute resources/metrics (such as, memory, CPU, IO, Network, Swapping, ready Q, Ballooning, etc.), Storage resources /metrics (such as, IOPS, latency, storage amounts, thin and thick provisions, etc.), Network resources/metrics (such as, Flow, Buffers, etc.), Application metrics (such as, TPS, Application Response Time, etc.) are all Commodities traded by the Service Entities. Constraints are also Commodities. Network and Storage Configurations, as well as Business Constraints, such as Compliance, Licensing, etc., are all Commodities traded by the Service Entities. Note that every Service Entity is a Buyer and a Seller.
TAP analysis engine, the Economic Engine (EE), uses the Market to control the SDDC in a Desired State, an equilibrium state where demand is satisfied by the supply, a state in which application performance is assured while the SDDC is utilized as efficiently as possible. Like in any market, prices are used to control the SDDC in equilibrium. The Sellers price the Commodities they provide/sell as a function of the Commodities’ utilizations while the Buyers shop for the Commodities they consume/buy. As a Buyer, a Service Entity shops around and decides where to consume the resources it consumes, while as a Seller, it continuously compares its revenue and expenses. When it makes money, it adds/provisions more of the inventory/resources it sells, while when it loses money, it removes/suspends some of the inventory/resources it sells. Each entity makes its own decisions and the SDDC as a whole is controlled in an equilibrium where application performance is assured while maximizing the ROI of the available capacity, on and off prem.
TAP automation uses the EE to deliver a unified, integrated, autonomic platform to control workloads running in the SDDC, deploy new workloads, as well as plan for any future changes, projections and reservations. TAP orchestrates the entire workload life cycles from conception to decommission.
TAP mediates with a broad range of platforms and systems across the entire IT stack and multi-cloud data centers, and maps them to the Market abstraction. TAP mediates with
- Physical Compute (such as HP, Dell)
- Virtual Compute (such as ESX, Hyper-V, KVM)
- Physical Network Devices (such as Cisco, Arista, Juniper)
- Virtual Networks (such as ACI, NSX)
- Physical Storage Devices (such as EMC VMAX, EMC VNX, NetApp, Pure, ExtremeIO, ScaleIO, HP 3Par)
- Virtual Storage (such as VSAN)
- Converge Platforms (such as UCS, Nutanix, HP)
- IaaS Platforms (such as OpenStack, vCAC, CloudForm, System Center)
- Containers (such as Docker)
- Containers as a Service (CaaS) Platforms (such as K8s, Mesos)
- Platform as a Service (PaaS) Platforms (such as OpenShift, Cloud Foundry)
- Applications (such as VDI, WebSphere, WebLogic, JBOSS, MySQL, Exchange)
- Workloads in the Public Clouds (such as AWS, Azure, SoftLayer)
By mediating with all of the above, TAP is the ONLY platform orchestrating resource allocation across the entire SDDC and, as such, it is the ONLY platform that assures application performance.
TAP integrates (via an open REST API) with a variety of Service, Infrastructure, Cloud and Access Management systems to support an integrated full workload life cycle.
TAP is the ONLY platform to deliver an integrated Topology Scope, Data Collection and Management Functions. Tap is the ONLY platform that solves the SDDC management problem.
To Troubleshoot or NOT to Troubleshoot?
In the "old world", the pre-virtualization world, we had no choice. We had to troubleshoot, pinpoint the root cause and fix it to assure the application QoS. Troubleshooting is a complex lengthy process that requires deep domain expertise, but we had no choice. Service couldn't have been restored and performance couldn't have been assured unless the root cause was identified and resolved.
It used to be "easy". The world was static with relatively few moving parts. A single application on a single OS on a single server with attached storage. With a few (sometime more than a few) point tools we were able to get our hands around and manage our environments. Well, SDDC changes everything. No more static boundaries and well defined interactions between the IT silos.
Ask yourself: Do I know where my applications are? Do I know where my virtual machines are? Do I know what resources they are using? Do I know how they perform? Do they need more or less resources to deliver on their goals? Are there bottlenecks in my environment? Where are the bottlenecks?
And more importantly: Do I know what I need to do now? In the next minute? Hour? Day? Week? Month? Do I need to start a new VM? Stop a VM? Move a VM? Do I know where to start/move the VM? Do I need to reconfigure any of its resources? Do I need to provide more resources? What do I need to do to address the bottlenecks? How do I prevent them?
The good news is that with the increased complexity, SDDC also provides thousands of knobs enabling a much better and flexible control of the environment. However, the traditional management tools, instead of taking advantage of this and “changing the game” of how we manage and control SDDC, are lagging far behind and continuing on the “trajectory to nowhere” of the past, focusing on collecting more and more data, providing less and less value, alerting when something is wrong in the environment, but leaving you the heavy lifting of troubleshooting, identifying the root cause and most importantly, fixing it!
A much more effective approach is to take advantage of the thousands of control knobs SDDC platforms expose and control the SDDC in a Desired State in which application performance is assured. In this approach, the software continuously collects the thousands of performance metrics and available capacity across the entire IT stack and the data centers, considering all business and physical constraints, analyzing all these inputs in real-time and controlling the SDDC in a healthy state. Instead of alerting you when problems occur, or are about to occur, the platform prevents them from happening in the first place. This approach enables you to maintain your environment in a much more stable and predictable state, avoiding problems, optimizing performance, maximizing infrastructure efficiencies and reduce operational costs!
Broadly speaking, failures can be categorized into two types of failures, “hard” and “soft”. “Hard” failures are failures where something is physically broken and will always require human intervention. “Hard” failures are relatively easy to troubleshoot. “Soft” failures are performance degradations that are very hard to troubleshoot. However, these failures can be prevented in software without human intervention. The majority of failures we are chasing are “soft”, so why not prevent them? Furthermore, even for “hard” failures, our first goal should be not how to troubleshot them, but rather, how to continuously deliver the QoS we need to deliver. Troubleshooting doesn’t have to be in the critical path of assuring application performance.
Lets look at some examples and the autonomic way TAP is handling them.
- Resource “Black Holes” - A runaway job that could be anything from a process that is out of control to a poorly written query with too many nested loops. The problem is that the more resources you throw at the issue, the more it’s going to use. It’s a “black hole” of resource usage.
When the SDDC is controlled by TAP, the offending application will quickly run out of budget, suspend itself and notify you of the violation. In so doing, TAP not only points you to the application that require someone’s attention, but more importantly, prevents interference and performance degradations of all the other applications in the shared environment.
- Storage Issues – In many large organizations, virtualization admins are separate from the storage virtualization admins and may not have the necessary visibility into the storage layer. When storage problems, such as IO latency, arise, you may think that moving VMs to a different storage may solve the problem. However, without understanding the SAN environment and, more importantly, the dependencies across the entire stack, the moves may cause bigger problems.
TAP has a complete topological view of all the workloads and their dependencies from the applications all the way to the back-end storage. TAP understands the complete path IOPS take to get to the back-end storage. This understanding is what enables TAP to identify the bottlenecks and drive the right actions to prevent performance degradation. TAP controls the tradeoffs across multiple dimensions, e.g. IOPS, storage latencies, storage amounts, compute, network, across the multiple layers of the IT stack down to the Data Store, Arrays and Controllers. By continuously analyzing these tradeoffs, TAP properly places VMs across compute, Data Stores, Arrays and Controllers, to minimize interference and prevent HBA bottlenecks or storage latencies due to arrays’ and/or controllers’ congestions. If additional actions, such as provisioning additional storage capacity, are required to control the environment in a Desired State, TAP will trigger these actions. Furthermore, TAP automatically discovers the storage configuration constraints as well as the thin provisioned storage to assure these are properly considered when controlling the environment in a Desired State.
- Network Issues – We may have a good handle on compute and storage, but what about network? We may distribute our workloads across our compute and storage to minimize CPU, memory and IO congestion, but, in doing so, introduce network latencies and bottlenecks. These tradeoffs are complex and continuously changing. Given any group of workloads, should they be placed on the same host and storage? Can they be placed on different hosts within the same clusters? Can they be placed in different clusters? Can they be placed in different data centers? Can they be placed in different clouds or different zones within a cloud? The answer is “it depends”. What are the congestion levels of the possible compute and storage where they can be placed? What are the levels of chattiness/network flows between these workloads at a given point of time? What is the available network bandwidth? If SDN is present, what are the communication policies in place and can the switches accommodate all the required policies to support the workload placement?
As complex as it is today, the level of complexity increases dramatically with containers. In addition to all of the above questions, ask yourself, can a given group of containers be placed in the same VM? Is the VM big enough to support all the containers? If placed across multiple VMs, where should these VMs be placed? On the same host? In the same cluster? Across clusters? Across Zones? Across clouds?
TAP automatically and dynamically discovers all of the above, continuously analyzes the workload performance and the afore-mentioned tradeoffs, and configures and places the workloads to properly satisfy these tradeoffs to control the environment in a Desired State.
Let Me Finish…
The industry continues to create suites of non-integrated point tools, chasing and collecting more and more detailed data. None of them solve the real problem we are facing: how to control the SDDC in a state in which application performance is assured while the environment is utilized as efficiently as possible.
How many more tools and suites will be pitched to organizations without solving the problem? This trajectory leads to management that is too complex, not scalable, and of limited value. Instead of reducing the management complexity and the operational costs, these tools only contribute to the increasing management nightmare. We must stop the trend.