Last year was the year of the Openstack. People talked about it, they deployed it, they tested it, and they had grandiose plans on rolling it into production. Few in my experience succeeded on schedule or without technical challenges and difficulties in skill-set gaps on their teams. While this is still being leveraged in many of my customer accounts; I am finding more and more people gravitating to alternatives like Azure, AWS and for elastic compute and development project needs.
Enterprise shops are beginning to deploy net-new development and web-facing applications that are too bursty in nature to want to host on-premises. This could include retail customers running web server front ends for Black Friday, software companies rolling our PaaS services to their development teams, or even the typical enterprise that sees user traffic increase monthly. Some choose to start by migrating existing stateless applications; others begin a regimented deployment model for net-new workloads so they don’t have to deal with altogether. In either case, both IT shops usually have very broad goals on what % of workload they would like to live in the cloud by a certain date. My fear is that the industry will soon run into the same challenges that we ran into with Openstack, mainly around real-time management and cost.
I am going to focus on Azure in particular because I am personally noticing it on the heels of AWS when it comes to adoption for performance and cost considerations (albeit most people I speak with are using some combination of AWS and Azure and still determining which one is deserving of precious budget dollars). But before going into details, it is important to understand the various Azure pricing models.
There are 3 consumption models for Azure that make this interesting:
- Pay as you go: “Only pay for what you use!”. This is really a loaded statement because it isn’t what you “use” per say; it’s actually about what you allocate – what size template you choose to rent, irrelevant of its actual consumption. We will dive into this later.
- Pre-paid Subscriptions: Good for the bulk buy if you want to pay upfront and save some money per unit you spend.
- Enterprise Licensing Agreements. This is usually the option I hear from some of my good engineering friends in my customer base who say: “Yeah, someone in management decided to buy an enterprise ELA and we don’t know why, but now they are asking us to get 40% of our workload there by 2018”. While it obviously gives the customer the best price point, this inevitably introduces a unique challenge for IT shops where every minute they don’t spend putting things in the cloud, the more money they waste… Talk about stress for the tech folks on your team!! The clock is ticking….
“Putting Azure aside for a minute, does your team currently have a proven method for rightsizing your workloads in the on-premises environment?”
With dynamic pricing, dynamic workloads, and the ever-changing landscape of new cloud providers promising to do it better, cheaper and faster, the ultimate question becomes: How do we make the real-time decisions necessary to guarantee the best quality of service, the best cost, and the right size for our workloads. We know there is a cost to doing this wrong, but what is it really? When I speak with my enterprise clients moving to, or considering Azure, I always pose the following question: “Putting Azure aside for a minute, does your team currently have a proven method for rightsizing your workloads in the on-premises environment?”
The question is either met with laughs, or a rapid-fire “Of course not…”, but I have never had anyone reply: “Oh yes, we never have arguments with app owners, we have perfect data on consumption, and we have configured automation to take care of this for us automatically!” (Ok, on second thought maybe I can think of a few of our customers doing this…but for most part it is not the norm)
The reason I always ask this is simple: If we don’t have a way to do this in an on-premises environment, what makes us think it is going to get any easier in the cloud?
The reason I always ask this is simple: If we don’t have a way to do this in an on-premises environment, what makes us think it is going to get any easier in the cloud? In fact, this promises to become more difficult as we deploy distributed applications consuming from different Azure regions and compute pools, all with different resource demands. The difference is that this time, missing the target can mean drastic overspend. in some cases, 70-80% more than needed
This is a new challenge for IT shops because typically white space for inactive consumption in the on premises datacenter doesn’t really present a hidden expense (you paid for the HW, you use it, and if its inactive it’s not like you are wasting resources save for overhead memory and CPU. In a way, it is easier because we can quantify the waste in these environments which makes us feel good. In, Azure we simply don’t know where these hidden costs are. And that is scary.
Let’s dive into a conservative example of how we run the risk of over-spending in Azure. We will make the assumption here that I am an IT admin managing an environment of 5000+ VMs on VMware infrastructure. I am being asked to deploy a net-new environment that will be used for the development of a new software application that will be used in production come 2018.
How we run the risk of over-spending in Azure
After initially scoping the project, I determine that I need 75 App Services, 75 Virtual Machines, and 50 SQL Databases (A total of 200 workloads). Doesn’t seem too bad right? The first challenge is the initial catalog selection for my template size. I run several weeks of testing in house and try to predict user load/traffic, but at the end of the day, I am really making a best guess. Needless to say, we end up going with “Bigger” to be safer. For my App Services I choose the standard offering, and the medium sized template with 4 cores, 7 GB RAM and 50 GB storage for $.40/hour in West US:
For my developers, I am expecting heavy utilization throughout the end of year as deadlines come up for the project roll out. At their request (and their Director’s request), we go with the standard A4 template in Azure West with 8 cores, 14 GB of RAM and 605 GB of storage at $.72/hour:
For my SQL Databases, I put them in East US for redundancy in case Azure West experiences any issues. I select the premium offering and pick the P2 offering at 250 DTUs and 500 GB storage at $1.25$/hour:
Using the nice shopping catalog on the Azure site, I am presented with my price tag: $108,966/month for my 200 VM deployment. This equates to $1,307,952 a year! (Good thing management approved the budget and its not coming out of my pocket…) Now the tough part comes: managing the run-time environments in real time as the developers do what developers do best and USE the machines. Questions race through my mind come November:
- How do I really know that I am getting the right performance?
- I am seeing more AWS pockets of workload spin up. Should I consider moving some of the workloads there? Will it save me money?
- Are my workloads still sized correctly? What if I incorrectly size them down because they aren’t using anything and then the workloads pick up in December?
- How do I show management that they are getting the best bang for their buck? The reporting and metering is so confusing in Azure!
By the end of this process, I am too nervous to touch the workload out of fear of performance repercussions. I let the workloads stay at the current size throughout the rest of the year and respond to performance issues once users complain about slowness.
Anyone else seeing the problem here? I have been promised the world of elastic compute, infinite capacity, agile service delivery and deployment, and QoS gurantees, but nothing has changed in the way I manage my application workload needs. I’m still just guessing how much I need to allocate to meet demand, rather than really consuming resources on a pay-as-you-go basis. It seems errily similar to my on-premises environment when it comes to making the right resource allocation decisions.
Let’s look at the opportunity cost of not making these decisions dynamically in real time. For the purpose of our example, lets assume I COULD HAVE sized each of my 200 machines down ONE template size in Azure throughout the course of the entire 12 months they ran. (Please reference the pricing tables to see the math done below)
Using the math above, I just overspent nearly $700,000 Dollars on just 200 workloads by over-sizing each by a single template increment. What if it required the next level down on each? The risk here is enormous, but most people I speak with don’t have a clear way to control this.
More importantly, the same cost goes the other direction in the case of under-sizing our workloads, except this time it isn’t as quantifiable. Performance and Effiency are at constant odds. The human operators become stuck in the middle of this resource alignment making the sacrifice of one vs. the other much more likely. With the Rubics Cube constantly changing and resetting, monitoring more data and giving this to people to analyze will not be enough to make real-time decisions.
The bottom line is that we are not in the business of building datacenters and plugging in technology. We look at the wonderful things that Openstack and Cloud has to offer, and what we really care about is our applications. Our business lives and breathes based on the success of our applications and the delivery of those services to our consituents.
I see too many people go down the cloud/opensource road for the right reasons and get stopped in their tracks because they are unable to decouple the applications from the idea that agnostic compute resources are simply a means to end to service their performance needs, and not the end itself. In other words, the challenge doesn’t stop once we get there. Our success and failure relies on how we put a sustainable model in place for people, technology and resources that allows our environment to self manage and bring people out of the resource alignment rabbit hole that stalls our efforts.
There must be an autonomic way to solve this problem.
There must be an autonomic way to solve this problem – autonomic in the way human systems self manage to keep us healthy. This includes dynamically sizing workloads in public and hybrid cloud environments, considering the mobility of workloads across regions within the same cloud provider or across cloud providers like AWS, and when to burst resources to/from the cloud when the workloads are not being used or when they need more resources. And most importantly, we mus be able to do this in real time while weighing the ever-changing trade-offs of performance, cost, locality and business demands.
In my next blog, I will jump into customer examples (without mentioning names for privacy reasons) that have sucessfully implemented an Azure cloud model in this new world, and the current ways they are trying ti mitigate this risk to further adopt a sucessful deployment
Follow me on Twitter @Bannamal
Email me with comments or questions at email@example.com