Demystifying Azure capacity reservations and reserved instances
I have two Microsoft Azure customers who both recently experienced issues in production for the same reason: confusion about Azure's "reserved instances" vs. "capacity reservations." Many customers incorrectly assume that capacity reservation is just another term for reserved instance, but in fact they mean very different things and many customers should have both...
Story time​
Customer A needed to shut down 10 different virtual machines in order to add disks and perform other maintenance. Upon completion of this maintenance they attempted to power on the virtual machines but encountered an error for each VM stating that Azure did not have capacity to fulfill the request. They opened a SEV-A case with Microsoft and the assigned support engineer coordinated with Microsoft's capacity management team to enable the VMs to be powered back on. Ultimately this issue caused the customer to exceed the planned downtime for their maintenance event and it therefore negatively impacted their availability for the month of July.
Customer B kicked off a job in August that attempted to create a VM scale set, but it failed due to a lack of capacity in their production region. The automation then fell back to building compute in a secondary region where capacity existed, allowing the job to run to completion with no impact to the end user. While this avoided an outage it resulted in tens of thousands of dollars of egress charges due to the compute accessing storage in a remote region.
Both customers had purchased "reserved instances" and so both were surprised that capacity was unavailable at the time it was requested.
Why did this happen?​
While it sounds counterintuitive, a quick read of Azure's documentation (https://learn.microsoft.com/en-us/azure/virtual-machines/capacity-reservation-overview) confirms that this was actually expected behavior:
| Differences | On-demand capacity reservation | Reserved instances |
|---|---|---|
| Term | No term commitment required. Can be created and deleted as per the customer requirement. | Fixed-term commitment of either one year or three years. |
| Billing discount | Charged at pay-as-you-go rates for the underlying VM size.* | Significant cost savings over pay-as-you-go rates. |
| Capacity SLA | Provides capacity guarantee in the specified location (region or availability zone). | Doesn't provide a capacity guarantee. Customers can choose Capacity priority to gain better access, but that option doesn't carry an SLA. |
| Region vs. availability zones | Can be deployed per region or per availability zone. | Only available at the regional level. |
*Eligible for the reserved instances discount if purchased separately.
The table above makes it pretty clear that only a capacity reservation promises an SLA but it doesn't say why! And if reserved instances aren't the same as capacity reservations then:
- Why might a customer choose to have a standalone capacity reservation (without a matching reserved instance)?
- Why would a customer ever choose to have a standalone reserved instance (without a matching capacity reservation)?
- When should a customer choose to have both?
- Is there any disincentive to have both?
The short answer is: it turns out that a reserved instance is a finance construct and a capacity reservation is an operations construct, meaning they target different Azure customer roles that typically are performed by different human beings who don't interact with each other. With that context in mind the long answer to the above questions is laid out below...
A quick vocabulary lesson​
I'll be the first to admit that these two terms sound interchangeable and I'll be the first to advocate that Microsoft should rename at least one of these things to avoid confusion.
- Reserved instances are about Microsoft granting discounts in exchange for the customer committing to run resources for 1 or more years (irrespective of whether those resources are ever actually consumed). These help Microsoft to better forecast hardware demands in Azure datacenters as part of its supply chain management activities, and help customers avoid pay-as-you-go rates for a portion of their environment.
- Capacity reservations are about customers securing capacity (irrespective of cost) to ensure mission-critical workloads aren't deprived of resources when needed.
Imagine the following scenario:
A large Fortune 500 customer has tens of thousands of workloads, hundreds of distinct engineering teams, dozens of operations teams and who-knows-how-many applications. The finance team sets a cost savings target of 20% for their Azure spend.
There could be no reasonable expectation that the finance team knows which Azure workloads should run 24x7x365 nor is it realistic that the finance team could easily or timely discover this information. Were Microsoft to make discounts an implicit capacity reservation, customer finance teams would be forced to spend months identifying which specific workloads were mission-critical as a prerequisite to realizing savings. And so instead Microsoft has separated the concepts into different workflows subjected to different permissions so that operations and finance professionals can stay out of each other's hair.
The case for standalone reserved instances​
As customer environments grow larger, so too does the probability that at the fleet level the customer will continuously run a VM of a certain SKU; it might not be the same VM running today and tomorrow and it might not be for the same application today and tomorrow, but on both today and tomorrow the customer will be running an E64d_v4 (for example) somewhere within their fleet. Extrapolated to a customer the size of General Electric it's easy enough to look at the last year of billing and project that for the next year of billing there is every reason to believe there will be 200 E64d_v4 VMs.
In this scenario savings can be attained by reserving 200 instances without knowing which workloads will consume them. The reservation scope can apply broadly across subscription/application boundaries, and as long as there are at least 200 E64d_v4 VMs running that meet the reservation scope (region/availability zone) the customer will realize the discount.
The case for a standalone capacity reservation​
Suppose a customer wants to execute a week-long disaster recovery failover drill, temporarily moving production from East US to West US, and suppose instead of maintaining an active or passive DR environment they rely on automation to deploy the DR environment on demand. In order to ensure 100% uptime the customer needs to make sure there is actually capacity available in West US before they start the drill and they need to also make sure there is capacity in East US at the end of the week when they fail back.
Or suppose a customer has a scheduled performance test in their QA environment that must be completed by the end of June. They don't typically need a ton of compute in QA and so, as a cost optimization, they only temporarily scale up/out QA for the duration of a performance test and then scale back down/in when the test is done.
Or suppose a US retail customer traditionally has their biggest sales volume between Thanksgiving and Christmas, as well as a change freeze that lasts from the week before Thanksgiving until the first business day of the new year.
These are all scenarios in which a capacity reservation is appropriate whether or not there is a reserved instance discount; the objective here is not to save money but, rather, to ensure uptime and performance during critical periods.
The case for combining reserved instances and capacity reservations​
The most easily understood example would be a mission-critical workload (ex: an accounting or HR system or a manufacturing system) whose minimum compute requirements are well understood and static over the course of a year. The application can scale out in excess of those minimums whenever and for however long it wants, but it never drops below those minimums. The application can use automation to outright replace one VM with another (instead of patching), but a VM of a given SKU will always exist.
However in reality the most common reason customers have both capacity reservations and reserved instances is actually happenstantial; with no coordination whatsoever, a finance team will purchase reserved instances to achieve cost savings and an operations team will create a capacity reservation to ensure availability of compute. There is no intent that the same VM benefits from both a capacity reservation and a reserved instance discount, and instead it occurs simply as a byproduct of both operations and finance departments doing their jobs correctly. The apparent synergy between the two is superficial.
The case against combining reserved instances and capacity reservations​
Microsoft itself hasn't implemented any tooling or process that would encourage customers not to create both a reserved instance and a capacity reservation for the same VM. In fact the documentation linked above very clearly states that if a customer creates both then the capacity reservation will be billed at the discounted reserved instance rate.
Really the only reason a customer might not want to apply both boils down to how a customer operates. Less mature customers might attempt to create internal rules that force any team having a long-term capacity reservation to coordinate with finance to secure a reserved instance, for example. Such processes can be extremely burdensome and create friction that results in slowing down IT and increasing its costs, as well as decreasing employee satisfaction.
As long as operations is able to create capacity reservations independent from finance purchasing reserved instance discounts there really is no reason a customer should not pursue both in parallel.
