We are continuously being confronted with challenge in designing infrastructure. Every aspect of our infrastructure has potential pitfalls for performance, resiliency, and the challenges faced with dynamic workloads.
Different software and hardware vendors often produce reference architecture documents and the classically titled "Best Practices" recommendations to go along with their offering. While they are meaningful to a degree, we also have to be acutely aware of the dangers of the supposed best practices. The reason for that is the subjectiveness of the designs. Let's explore a little why this is the case.
Today's Best Practices are Tomorrow's "What we're we thinking?"
"I have to say that in 1981, making those decisions, I felt like I was providing enough freedom for 10 years. That is, a move from 64 K to 640 K felt like something that would last a great deal of time. Well, it didn't - it took about only 6 years before people started to see that as a real problem.” - Bill Gates
We may laugh now at the classic misquote from Bill Gates that "640kb ought to be enough", but this has a very important context that is wrapped around it. As noted above, the original quote which may have triggered it, was around the effective upper reach of memory for PC hardware. Much has changed since then of course, so this is the most important thing to be wary of when attaching a Best Practice to a design. Without understanding the constraints that derived the upper boundary, we will inevitably fail to adjust when the constraints are lifted and the boundaries change.
The sports analogy to this is "moving the goal post" which is another popular phrase. It's the concept that every time you get closer to the goal, the variables change and the boundary is shifted which invalidates the original goal.
It isn't that the goal was incorrect, but the rest of the variables have now lessened the value of the goal. In hardware terms, imagine still using 640kb of RAM to run a modern Windows desktop. You can probably guess how that will perform.
Resiliency is a Malleability
On the concept of resiliency, we design for potential failures throughout the system. We can do this by providing multiple paths to a systems via the network, or by mirroring data to another location, or creating clustered services to back our applications.
Each of these resiliency strategies is based on protecting at different points of failure throughout the overall system. This is also a challenge.
Imagine a scenario in earlier days of distributed computing where a production system would have a twin server located next to it as a hardware hot spare in case of failure. That may seem bizarre in today's data center, but at the time it was the best strategy to reduce the time to replace failed hardware.
I say that resiliency is malleability because it is done based on a number of variables depending on the design. These variables will change, and thus the Best Practices would have to change along with them.
Printed Docunentation: the Path to Legacy
Printing isn't the issue. Retaining documents in printed format is. When we print a server build document, or an application deployment document, we freeze the state of that build process in time. Two weeks later someone may discover that line item 28 to plug the network cables into two switches in the same rack is a terrible idea because the network team is now using the same uplink paths to the core and we have a single point of failure.
The electronic documentation is updated, but the printed doc is no longer what it claims to be with its 45 pitch font boldly stating "Best Practices" on the cover page. If you don't believe that this occurs, I would encourage you to ask your sysadmin for the build documentation and you may find that they peek through a pile of paper for one.
The moment a document is printed, it's static. Designs, constraints, and workloads are dynamic. You can easily see where this creates a problem.
Software Upgrades and Best Practices
Do you revisit all of your best practices for applications and infrastructure as each software update occurs? I'm going to go out on a limb and say that as many people update environments with best practice designs as those who read the entire End User License Agreement before clicking "I Agree".
Even teams that make use of configuration management solutions and other orchestration tools can find themselves at a point of being caught off guard by changes to the feature set in a new piece of software or one that received an incremental update.
Schrodinger's Best Practices
At any moment in time, the state of your infrastructure is both known and unknown. Much like the classic conundrum presented by the concept of Schrodinger's Cat, the Best Practice is both failing and winning.
As you may already know, the core of what we do at Turbonomic is built around the concept of a Desired State, which is dynamic and continuously changing based on the shift of supply and demand of resources. The key part of the message is that a Best Practice is like the Best Athlete. If we keep using the same one, the rest of the environment will change around us and leave us behind.
No matter how good we were at the point in time, the environment is constantly changing and along with it, so are the Best Practices. This is why we have to change the way we think in the modern data center.
Best Practice guidelines are like world records: meant to be broken by something better.