Anyone who operates highly scalable infrastructure will know that there is one maxim that they must abide by:
Assume everything fails
It may seem like a rather morbid approach to the design and operation of infrastructure. It becomes obvious when you shift from a "aim for 100% uptime" to the Site Reliability Engineer (SRE) approach.
What I mean by the shift in approach, is that we change from building on the architect's classic assumption that we can design for totally reliability. The SRE approach is to rely on the ability of every layer of your application infrastructure to be failing and recovering continuously.
Scale-Out Fails By Design
I once read something from Kelly Sommers (@kellabyte on Twitter) about how she operated database infrastructure at such a scale that about 10% of the nodes are failed at any given time due to the load put on them and other operational impact.
Anyone who is designing a system for applications at scale will understand this. More importantly, they will accept it as part of the deal. The trick for the traditional IT architect is to adapt to this new concept that failed is built into the service.
Richter Scale Cost Challenges
You are probably familiar with the Richter Scale which is used to measure seismic events. The interesting thing about the mathematics of the Richter Scale is that each point on the scale is actually a multiple increase of the previous point rather than a linear rise of 1.
For example, a 6.0 earthquake is a 63 terajoules measurement, and a 7.0 is 2 petajoules whereas a 5.0 is 2 terajoules. The reason this is important is that it's relatively proportional to the costs (both effort and capital) of architecting extremely highly available IT infrastructure.
The cost of operating a 99.999 percent availability solution could be much more than double the 99.99 percent alternative. Those costs and operational challenges reduce even more as we go to 99.9 and 99 percent. So, how do we change our ways to reduce the cost?
Fail by Design
We aren't suggesting that you can comfortably have 10 percent of your infrastructure failing or in distress at all times, but if you think about what it takes to design and deploy applications for 99 percent availability of sub components, you begin to gain the advantage of having a system that can survive failures throughout the structure.
This is a shift in the practice. It's where we begin to think more like a cloud-native architecture which has to assume that loosely-coupled components are in place and that resiliency is built in a scale-out methodology.
While the upfront costs to us may seem higher now, this is true technical debt reduction. We are acquiring technical debt at rapid rates today because we often don't see the value in stepping back and attacking be issue as an SRE.
I'm the end, you're paying for those extra 9s many times over compared to the cost of a tracking the technical debt now. That's where the Richter scale becomes a rather frightening reminder of the true cost of deigning resiliency at the wrong layers.