Azure VM SLA ensures 99.9% uptime for Production and 99.95% uptime for Dev/Test environments. This means you can expect your virtual machines to be available at least 99.9% of the time.
The SLA also covers data redundancy, with Azure replicating your data across multiple locations to prevent data loss in case of a failure. This ensures that your data is always safe and accessible.
Azure's robust infrastructure and redundant systems make it possible to maintain such high uptime and availability. With Azure VM SLA, you can focus on building and deploying your applications without worrying about the underlying infrastructure.
In the event of a failure, Azure's automated processes and 24/7 support team ensure that your virtual machines are restored quickly, minimizing downtime and ensuring business continuity.
What Is?
SLA stands for Service Level Agreement, which is a promise made by Microsoft to its customers regarding the uptime and availability of its services. Specifically, Microsoft guarantees Virtual Machine (VM) connectivity to at least one instance for a certain percentage of the time.
The percentage of uptime depends on the deployment configuration of the VMs. For instance, if you have two or more VMs deployed across two or more Availability Zones in the same Azure region, Microsoft guarantees 99.99% uptime.
SLAs are in place to ensure that customers get the level of service they expect. If Microsoft doesn't meet its SLA, customers can submit a claim to receive a credit on their account.
There are two types of SLAs: one for VMs deployed across Availability Zones and one for VMs deployed in the same Availability Set or Dedicated Host Group. The uptime percentage for the latter is 99.95%.
Here's a breakdown of the SLAs:
- 99.99% uptime for VMs deployed across two or more Availability Zones in the same Azure region
- 99.95% uptime for VMs deployed in the same Availability Set or Dedicated Host Group
If you're eligible to submit a claim, you must notify Microsoft's Customer Support within 5 business days of the incident. You'll also need to provide sufficient evidence to support your claim. Once validated, you'll receive a credit of 10% for SLAs between 99.5% and 99%, or 25% for SLAs below 99%.
Azure VM SLA Requirements
Azure VM SLA Requirements are crucial to ensure high availability for your workloads.
To define your requirements, you'll need to identify the cloud workloads that require high availability and their usage patterns. This includes considering the Percentage of Uptime, Mean Time to Recovery (MTTR), Mean Time between Failures (MTBR), Recovery Time Objective (RTO), and Recovery Point Objective (RPO).
Use a combination of these variables to define your target Service Level Agreement (SLA) for each application workload.
Microsoft defines its own SLA for each Azure service, and you can consult Azure documentation to see the guaranteed SLA for the services you are using. If you require a higher SLA than that guaranteed by Azure, you can set up redundant components with failover.
Here are some key SLA requirements to consider:
- Percentage of Uptime: The percentage of time your workload is available and operational.
- Mean Time to Recovery (MTTR): The average time it takes to recover from a failure or outage.
- Mean Time between Failures (MTBR): The average time between failures or outages.
- Recovery Time Objective (RTO): The maximum time allowed to recover from a failure or outage.
- Recovery Point Objective (RPO): The maximum amount of data that can be lost in the event of a failure or outage.
Azure VM SLA Planning and Architecture
When designing an Azure VM SLA, start with a Failure Mode Analysis (FMA) to identify potential failures and their implications.
Consider costs, as each redundant layer effectively doubles your cloud costs, so ensure you have licenses and infrastructure to support the additional, redundant instances, including storage, networking, and bandwidth.
Avoid single points of failure and use load balancing to distribute requests between redundant components. This will help ensure systems can fail gracefully and restore operations without disruption of service.
Plan Your Architecture
To plan your Azure VM SLA, you need to start with a Failure Mode Analysis (FMA). This will help you identify the types of failure you might experience and the implication of each.
A FMA will also guide you in identifying the level of redundancy required for each component, ensuring you avoid single points of failure and use load balancing to distribute requests between redundant components.
Consider the costs of redundancy, as each redundant layer effectively doubles your cloud costs. You'll need to ensure you have licenses and infrastructure to support the additional, redundant instances, including storage, networking, and bandwidth.
To ensure systems can fail gracefully, isolate critical resources and use compensating transactions and asynchronous operations. This will allow business operations to continue and be applied to a redundant component if one fails.
Replicate data in a way that supports your redundancy strategy and your Recovery Time Objective (RTO) and Recovery Point Objective (RPO). This will enable you to failover or recover without disrupting service.
Document every step, whether automatic or manual, to failover to a redundant component and recover or "failback" to the original component. Instructions should be short and clear enough to use in case of emergency.
Proximity Placement Groups and Cloud Volumes
Proximity Placement Groups and Cloud Volumes can be a game-changer for mission-critical workloads, allowing you to locate compute resources within the same data center and cutting down network latencies.
Azure proximity placement groups limit the effects of latency, making them ideal for latency-sensitive workloads. This is especially useful for applications that require low latency, such as real-time analytics or video streaming.
By using proximity placement groups with Cloud Volumes ONTAP, you can achieve high availability with an RPO of zero and an RTO of under 60 seconds. This is made possible by Cloud Volumes ONTAP's ability to leverage native Azure Storage capabilities like snapshots, disaster recovery, encryption, and low-RPO HA configuration.
Azure proximity placement groups work by allowing users to decide where compute resources are placed within an Azure region, making it possible to locate them all within the same data center. This reduces network latencies and improves overall performance.
Zones
Azure has a feature called Azure Availability Zones that allows you to deploy applications across multiple data centers, protecting applications against outages in any one Azure data center.
You can use availability sets that operate locally within an AZ, or region pairs that run applications across different geographical regions to achieve high availability.
Azure supports high availability for most of its services, including Azure VMs, SQL Database, and Azure Load Balancer.
To gain cross-zone high availability, you can leverage Azure zone-redundant storage (ZRS) with Cloud Volumes ONTAP.
Cloud Volumes ONTAP can now use ZRS to protect your data against service disruptions in Azure that go beyond the native Azure resiliency features.
Azure VM SLA Testing and Monitoring
To ensure your Azure VM SLA is reliable, you should perform end-to-end testing under realistic failure conditions. This includes testing different failure scenarios, such as a combination of failures, and measuring recovery time.
You can conduct additional tests to increase your level of confidence, including:
- Identifying failures under load by performing realistic load testing until a system fails, and observing how failure mechanisms behave.
- Running disaster recovery exercises, where systems go down and your team must quickly operate according to your disaster recovery plan.
- Testing health probes to ensure they respond correctly in case of failure.
- Periodically checking that data from monitoring systems is accurate, to ensure you can detect failure in time.
Monitoring your Azure VM SLA is also crucial. You should leverage Azure's extensive logging and auditing capabilities, including semantic and asynchronous logging, to detect potential issues before they become major problems.
Perform End-to-End Testing
Perform End-to-End Testing is a crucial step in ensuring the reliability of your Azure VM system. This involves testing the system under realistic failure conditions.
To do this, use fault injection testing to test different failure scenarios, including a combination of failures, and measure recovery time. Test both failover and failback to ensure a smooth transition.
Identify failures under load by performing realistic load testing until a system fails, and observe how failure mechanisms behave. This will give you a better understanding of how your system performs under stress.
Run disaster recovery exercises to test your team's ability to quickly operate according to your disaster recovery plan. Conduct a planned or unplanned experiment where systems go down and observe how your team responds.
Test health probes to ensure they respond correctly in case of failure. The Azure load balancer uses health probes to identify component failure, so it's essential to test them regularly.
Test monitoring systems periodically to ensure that data from monitoring systems is accurate. This will help you detect failure in time and take corrective action.
Here are some additional tests you can conduct to increase your level of confidence:
- Identify failures under load
- Run disaster recovery exercises
- Test health probes
- Test monitoring systems
Detect Failure in Time with Probes and Monitoring
To detect failures in time, you should use probes and check functions to get fresh data about service availability. This is critical to high availability.
Implement Azure health probes to monitor your application's health. Run check functions from outside the application to get accurate results.
Don't only pay attention to complete failure, also watch degrading health metrics. This can provide a warning signal that failure is about to happen.
Create an early warning system by identifying key indicators of application health. Alert operators when a system reaches a problematic threshold value.
Leverage Azure's extensive logging capabilities, including semantic and asynchronous logging. Separate application logs from audit logs to get a clear picture of what's happening.
Monitor remote call statistics like latency, throughput, and percentage of errors. This will help you identify potential issues before they cause a failure.
Be aware of subscription limits, including storage, compute, throughput, and other limitations. Monitor for limited metrics and act before you go into overage.
Frequently Asked Questions
What is the service level agreement SLA of 99.95 percent availability?
A 99.95% SLA guarantees 99.9975% uptime, allowing for 21 minutes and 54 seconds of permissible downtime per year. This translates to a highly reliable infrastructure with minimal planned or unplanned outages.
How do I check my Azure SLA?
To check your Azure SLA, visit the Azure Service Level Agreement (SLA) page for your specific service, where you'll find detailed information on uptime and performance guarantees. Simply navigate to the SLA page for your service, such as PostgreSQL or Online Services, to view your SLA terms.
Sources
- https://bluexp.netapp.com/blog/azure-high-availability-basic-concepts-and-a-checklist
- https://www.directionsonmicrosoft.com/blog/azure-slas-what-customers-need-to-know/
- https://www.viacode.com/azure-slas-for-uptime-and-availability/
- https://k21academy.com/microsoft-azure/admin/day5-recap-2/
- https://community.sap.com/t5/enterprise-resource-planning-blogs-by-members/quick-guide-to-sap-on-azure-sla-and-ola/ba-p/13476159
Featured Images: pexels.com