At the moment, nearly every sysadmin and security admin of the world is looking for answers to the real impact of Meltdown/Spectre – and finding even more questions about how their systems will be affected and the potential undetected issues that have already been exploited.
One thing is certain: the impact on application performance is very unclear, creating uncertainty and risk for the vast majority of IT organizations.
To manage this risk, organizations must be ready to calibrate the impact on applications and adjust the resources required after the patch is applied. The remediation process also highlights the critical need of systems thinking and automation. Patching and managing environments at any scale is untenable without centralized systems management. This proactive approach will mitigate the risk to the service they provide to end users.
But I’m getting ahead of myself!
First, let’s take a closer look at both the security and performance impact…
This bug was initially discovered and reported in June 2017 by the Project Zero team at Google. It appears that the affected environments include Intel, AMD, and ARM processors. (A full disclosure of the vulnerability and examples of exploits are available at the Project Zero blog, which goes deep into how it works.
The best-case scenario (to date) is that additional boundaries will need to be built into operating systems and virtualization stacks to prevent the sharing of kernel memory spaces. The net result of this is a slowdown in the way these underlying operating environments can access the physical compute layers. As we get further into remediation of the potential exploit, there will be more done to reduce the impact.
The reality is that the exploit potential will very likely outweigh the negative impact of patching it. Mitigation strategies are thin given the reach of this and risk to data access/loss as a result of an organization being impacted by the exploit.
As a result of the proposed “fix” for these vulnerabilities, there is high potential for an impact on performance. TechCrunch covered this, writing: “The Meltdown fix may reduce the performance of Intel chips by as little as 5 percent or as much as 30 — but there will be some hit. Whatever it is, it’s better than the alternative.”
Mitigating the Potential Impact of Meltdown and Spectre Patching
In general, it’s common that IT admins tend to immediately react when there is an issue where performance could be measurably reduced or that a disruption of service due to performance occurs. So how can this play out. Consider these two examples:
- Out of 1,000 IT environments, it’s expected that 100% will be impacted by Meltdown or Spectre which, when a patch becomes available, may reduce performance
- Out of 1,000 IT environments, nearly 100% are suffering from workload performance issues which, when optimized continuously by software, will increase performance
The reactive effect of measurably reducing performance is a no-brainer, right? What I find interesting is that there is still a challenge in some cases in relating the second example, which has the same measurable decrease, but it is not triggered by a compelling event such as the Meltdown or Spectre exploit mitigation.
Turbonomic delivers a measurable increase in workload performance, and an additional increase in efficiency of the utilization of the environment simultaneously. With a great probability that a known performance reduction is very likely in many organizations’ environments, it amplifies the value Turbonomic can deliver. For instance, the Turbonomic platform allows running scenarios to assess which parts of the infrastructure may be impacted the most from the patch performance cost BEFORE that cost is felt. A few examples of planning scenarios are worth considering:
- When applying the patches there will be disruption to host and guest availability. Run plans in Turbonomic to determine whether a given cluster has sufficient capacity to accommodate the worse-case patch performance cost plus maintain compliance with High Availability and host maintenance requirements. Adjust the plan demand up by 30% and run a plan to see if more hosts are required
- Having identified clusters at risk, run a plan to see if inefficiencies can be unlocked through cross-cluster migrations. Scope a Turbonomic plan to two or more clusters, relax constraints, and see the effect of redistributing workload across a wider pool of infrastructure liquidity. Trade the technical effort of implementing cross-cluster migrations against the cost of having to add more hardware into hot clusters to address the performance cost. Granted, some workloads will never be able to migrate due to license and other considerations, but this will give you a chance to re-examine some of those decisions and the cost of segregating workloads and locking inefficiencies.
We are monitoring the issue closely to help customers navigate the road ahead… stay tuned for best practices to evaluate the potential impact on Meltdown/Spectre impacted environments, and how Turbonomic can help!