To configure fault domains for high availability in Azure, you'll want to start with the Virtual Machine Scale Set.
A Virtual Machine Scale Set is a resource that allows you to manage and scale multiple virtual machines as a single entity.
The scale set is the central point for configuring fault domains, which are used to ensure that your virtual machines are running in separate failure domains.
By configuring fault domains, you can ensure that your virtual machines are spread across multiple physical locations, reducing the risk of a single point of failure.
Azure Fault Domains
Azure Fault Domains are a crucial concept in Azure, and understanding them can help you configure your resources for high availability. A fault domain defines a set of Hyper-V hosts that could be affected by a physical failure such as a power source or network failure.
You can configure fault domains in your Azure cluster, and it's heavily recommended to do so before enabling Storage Spaces Direct. This configuration helps you distribute your virtual machines across multiple fault domains, reducing the risk of downtime due to hardware failures.
Each fault domain shares the same underlying infrastructure, power source, and network switch, which means that if any failure happens on the shared infrastructure, it will affect all VMs running in the same fault domain.
Here are some key facts about Azure Fault Domains:
By understanding and configuring Azure Fault Domains, you can create a more resilient and highly available infrastructure for your applications.
Configuring Fault Domains
Before enabling Storage Spaces Direct, it's heavily recommended to configure fault domains in your cluster. This configuration is crucial to ensure that your virtual machines are distributed across different physical hardware platforms, reducing the risk of downtime due to hardware failures.
A fault domain defines a set of Hyper-V hosts that could be affected by a physical failure such as a power source or network failure. Virtual machines in the same fault domain share the same underlying infrastructure, power source, and network switch.
If any failure happens on the shared infrastructure, it will affect all VMs running in the same fault domain. This is why it's essential to distribute your VMs across multiple fault domains to prevent a potential failure from affecting all your VMs.
In Azure, a fault domain is a logical grouping of resources that share a common physical hardware platform, such as a rack, server, or power source. Azure ensures that no two resources in the same fault domain are running on the same physical hardware platform.
Azure Availability and Deployment
You can configure fault domains in an Azure availability set. This is crucial to ensure your workloads are distributed across different physical hardware and software components, reducing the risk of downtime.
Azure Fault and Update Domains are logical groupings of resources that help you distribute your workloads. A fault domain is a logical grouping of resources that share a common physical hardware platform.
You can separate VMs across up to three fault domains in an availability set on the Azure Portal. This assigns five update domains by default, which you can increase up to 20.
Creating an Azure Availability Set
Creating an Azure Availability Set is a crucial step in ensuring your resources are highly available and fault-tolerant. You have two options to start with: creating an availability set and VM at the same time or creating an availability set and using it while creating a new VM.
Azure Service Fabric governs most of the configuration for availability sets, so it's essential to understand how they work. You can separate VMs across up to three fault domains, which assigns five update domains by default, but you can increase this up to 20.
Choosing the right number of update domains (UDs) and fault domains (FDs) is crucial. If you plan to place more VMs in this tier than the specified number of UDs, all additional VMs will be placed into the first UD. This means that in case of an Azure planned maintenance, it could reboot the whole UD and all VMs in it.
Here are the default settings for Azure Availability Sets:
By understanding how Azure Availability Sets work and configuring them correctly, you can ensure your resources are highly available and fault-tolerant, reducing the risk of downtime due to hardware or software failures or updates.
Availability Zone Support
Azure provides a robust availability zone support that ensures high uptime and reliability for your applications. You can create a Virtual Machine Scale Set that spans multiple zones, providing zone-redundancy and automatic failover to other zones in case of a failure.
Azure regions are divided into physically separate groups of datacenters known as availability zones. Each zone is designed to operate independently, and services can fail over to one of the remaining zones in case of a failure.
You can create a scale set that uses availability zones with various methods, including the Azure portal, Azure CLI, Azure PowerShell, and Azure Resource Manager templates.
A zonal deployment is a scale set that runs in a single zone, while a zone-redundant deployment spans multiple zones. With zone-redundant deployment, VMs are evenly balanced across zones by default.
Here are the spreading options for deploying a scale set into one or more availability zones:
To achieve full zone-redundancy with zonal VMs, it's recommended to deploy two or more VMs across different zones.
Spreading Options
Azure's scale sets offer three spreading options to deploy your VMs across availability zones. Max spreading is the recommended deployment option, as it provides the best spreading in most cases.
To achieve max spreading, you can set the platformFaultDomainCount to 1. This will spread your VMs across as many fault domains as possible within each zone.
With max spreading, you can only see one fault domain in both the scale set VM instance view and the instance metadata. The spreading within each zone is implicit.
Static fixed spreading is another option, where you set the platformFaultDomainCount to 5. This will spread your VMs exactly across five fault domains per zone, but the request will fail if there aren't five distinct fault domains available.
You can also align the number of scale set fault domains with the number of managed disks fault domains. This can be done by setting the platformFaultDomainCount to 2 or 3, which can help prevent loss of quorum if an entire managed disks fault domain goes down.
Here's a summary of the spreading options:
PowerShell Configuration
To configure fault domains, you can use PowerShell, which is a powerful tool for managing your Azure resources.
First, you need to initialize the $CIM variable by using the following command: Cluster-Hyv01 is the name of my cluster.
The Get-ClusterFaultDomain cmdlet is used to gather fault domain information, and it automatically creates a fault domain for each node.
To create an additional fault domain, you can use the New-ClusterFaultDomain cmdlet, which allows you to specify the number of nodes you want to include in the new fault domain.
After creating the new fault domain, you can use the Get-ClusterFaultDomain cmdlet again to see the updated list of fault domains.
To set the Fault Domain parents, you can use the cmdlet, and then you can see the result in the Failover Clustering manager, where each node belongs to a specific fault domain.
For example, in the Failover Clustering manager, you can see that each node belongs to Rack-22U and the site Lyon, which indicates that the fault domain configuration has been successfully set up.
Sources
- https://www.tech-coffee.net/fault-domain-awareness-with-storage-spaces-direct/
- https://4sysops.com/archives/managing-azure-availability-sets/
- https://www.terminalworks.com/blog/post/2023/04/16/deploying-domain-controller-in-azure-best-practices-securing
- https://www.smikar.com/azure-fault-and-update-domains/
- https://learn.microsoft.com/en-us/azure/reliability/reliability-virtual-machine-scale-sets
Featured Images: pexels.com