Recently, there was a complete outage for the AP-SOUTHEAST-2 Region. What we learned from this is that the weather is out of our control. Due to terrible storms affecting power across major parts of the geography, there was a single Availability Zone (AZ) that was down for an extended period of around 5 hours.
Does this mean that Amazon has failed? Not really. AWS provides resiliency across the globe by providing multiple regions, and within each region, multiple Availability Zones. This means that the overall AWS infrastructure is able to withstand significant disruption without truly becoming unavailable.
Many businesses weren’t ready for what was about to happen on that fateful day. But, just like the classic Smokey the Bear poster that exclaimed “Only You Can Prevent Forest Fires”, the fault mostly rests on us as consumers of AWS when we lose access to systems hosted by AWS.
Resiliency is Our Responsibility
As stated in the AWS compliance site the security model for AWS and the customers i what is called the Shared Responsibility Model, clearly delineates infrastructure from applications:
The same holds true for infrastructure availability. AWS assures that you have infrastructure availability in the context of offering a globally available, resilient set of infrastructure services. Some services are natively available regardless of region (e.g.VPC, Route 53, S3, DynamoDB) which leaves the core components of compute on EC2 as where we can run into a challenge.
In the Sydney outage, we saw an entire AZ go away. So, what’s the solution to this? Customers are required to architect their solution to be able to withstand outages where they are specifically stated to be prone to occur. EC2 can definitely go away. If you host EC2 infrastructure, you even get outage notices for patches to the underlying hosts that will reboot your instances.
What is clear in the warning emails about planned outages is that the assumption is that you have alternate resources able to take on the application load and functionality.
I don’t often subscribe to the traditional best practices, but when it comes to architecting for resiliency, AWS gives you the clear answers:
- Span multiple Regions and Availability Zones using multiple EC2 instances and Auto-scaling
- Use ELB (Elastic Load Balancers)
- Use S3, with replication for data storage
- Use SQS (Simple Queue Service) for message queuing
- Use Route 53 for DNS
- And so on...
There are many more ways to embrace resiliency at all layers of the application stack. You can see how most of the solutions will lead you to buying a whole lot more of AWS services. That’s by design. There can be no doubt that they give you the tools for your toolkit and it is your choice on defining resiliency in how you deploy.
AWS Isn’t Without Fault
Many folks recall where small scale, and sometimes larger scale outages have occurred for the public cloud juggernaut. One thing that you can be sure of is maximum transparency when it does occur. There are lessons that have be learned from these outages that should affect our systems architecture.
Follow the best practices of any AWS environment which is “architect for failure”. It’s bound to happen. It happened on-premises I’m sure, but the difference is that The Register wasn’t there to cover it when your racks lost power.
The moral of the story is that the toolkit is there to build something resilient. Don’t depend on “the cloud” to save the day. We have the full responsibility to deploy resilient solutions. Odds are that many of the companies who suffered from the outage were probably running a single data center application or similarly designed on-premises environment before, or in addition to the public cloud infrastructure.
In other words, just like Smokey the Bear says “Only You can Prevent Application Outages"
Image source: Smokey The Bear Meme Generator - https://imgflip.com/memegenerator/16716795/Smokey-The-Bear