Azure Central US Outage Causes and Prevention

Author

Reads 665

Close Up Photo of Cables Plugged into the Server
Credit: pexels.com, Close Up Photo of Cables Plugged into the Server

Azure Central US outages can be a real headache, especially for businesses that rely on the platform.

One of the main causes of Azure Central US outages is network connectivity issues.

Azure Central US outages can also be caused by software bugs, which can lead to system crashes and downtime.

To prevent Azure Central US outages, it's essential to implement regular software updates and patches.

Monitoring system performance and responding quickly to issues can also help prevent outages.

Azure Central US offers a robust monitoring system that can help detect potential issues before they become major problems.

Azure Regions

Azure regions are a key part of Azure Central US, and they serve multiple purposes.

Azure regions provide geographic distribution, which is essential for businesses with global operations. This allows them to deploy services in multiple regions to ensure availability and resilience.

Azure regions also offer data residency and compliance, which is critical for meeting regulatory requirements. This is especially important for businesses that handle sensitive data.

Credit: youtube.com, AZ-900 Episode 7 | Geographies, Regions & Availability Zones | Microsoft Azure Fundamentals Course

Disaster recovery and business continuity are also key benefits of Azure regions. By deploying services in multiple regions, businesses can ensure that their operations remain available even in the event of a disaster.

High availability and fault tolerance are also provided by Azure regions. This means that services are less likely to be affected by outages or other issues.

Azure regions also offer scalability and load balancing, which is essential for businesses that experience sudden spikes in traffic. This allows them to quickly scale up to meet demand without affecting performance.

Here are the main uses of Azure regions:

  • Geographic Distribution
  • Data Residency and Compliance
  • Disaster Recovery and Business Continuity
  • High Availability and Fault Tolerance
  • Service Selection and Feature Availability
  • Scalability and Load Balancing

High Availability

High Availability is a top priority for any business running on Azure Central US. Azure Availability Zones is a high-availability offering that protects your applications and data from datacenter failures.

With Availability Zones, you can ensure that your applications and data are replicated across multiple physical locations within a region, reducing the risk of data loss due to a single-point-of-failure.

Credit: youtube.com, Azure High Availability | Cross Region

Azure Availability Zones offer industry best 99.99% VM uptime SLA, making it an attractive option for businesses that require high uptime.

Not every region has support for Availability Zones, but Central US, East US 2, West US 2, West Europe, France Central, North Europe, and Southeast Asia are some of the regions that do.

Availability Sets are another way to achieve high availability on Azure Central US. They provide redundancy for your virtual machines by spreading them across multiple hardware nodes.

By deploying your VMs across multiple hardware nodes, Azure ensures that if hardware or software failure happens, only a sub-set of your virtual machines is impacted.

Here are some of the regions that support Availability Zones:

  • Central US
  • East US 2
  • West US 2
  • West Europe
  • France Central
  • North Europe
  • Southeast Asia

To achieve comprehensive business continuity on Azure, you should build your application architecture using the combination of Azure Zones with Azure region pairs.

Disaster Recovery

Microsoft Azure's region pairing strategy is a game-changer for disaster recovery. Azure groups its regions into pairs within the same geography to support high availability and disaster recovery.

Credit: youtube.com, Disaster Recovery in Microsoft Azure

These region pairs ensure that one region is prioritized for recovery if both experience downtime simultaneously, which is a huge plus for businesses that need to be up and running quickly.

By strategically utilizing these region pairs, businesses can design more resilient architectures that can withstand even the most unexpected outages.

Azure Central US Issues

The recent Microsoft outage in the Central US region was a significant incident that affected multiple Azure services, including App Service, Azure Active Directory, and Virtual Machines.

Between 21:40 UTC on July 18, 2024, and 12:15 UTC on July 19, 2024, customers experienced significant issues with these services due to an Azure configuration update that disrupted the connection between compute and storage resources.

Several services were impacted, including Azure Cosmos DB, Microsoft Sentinel, and SQL Database, all of which experienced failures in service management operations and connectivity or availability issues.

The affected services include:

  • App Service
  • Azure Active Directory (Microsoft Entra ID)
  • Azure Cosmos DB
  • Microsoft Sentinel
  • Azure Data Factory
  • Event Hubs
  • Service Bus
  • Log Analytics
  • SQL Database
  • SQL Managed Instance
  • Virtual Machines
  • Cognitive Services
  • Application Insights
  • Azure Resource Manager (ARM)
  • Azure NetApp Files
  • Azure Communication Services
  • Microsoft Defender
  • Azure Cache for Redis
  • Azure Database for PostgreSQL-Flexible Server
  • Azure Stream Analytics
  • Azure SignalR Service
  • App Configuration

Microsoft is working to prevent such incidents in the future by implementing several improvements across its storage, SQL, and Cosmos DB services.

What Happened?

Credit: youtube.com, Azure Incident Retrospective: Storage issues in Central US, July 2024 (Tracking ID: 1K80-N_8)

In April, The Futurum Group published a report analyzing cloud availability over 12 months, highlighting Azure's tendency to have outages that affect many services or regions at the same time.

The latest outage follows this pattern, with a root cause tied to excessive cross-dependencies between services in Azure's cloud architecture.

Between 21:40 UTC on July 18, 2024, and 12:15 UTC on July 19, 2024, customers experienced significant issues with multiple Azure services in the Central US region.

This disruption stemmed from an Azure configuration update that disrupted the connection between compute and storage resources.

Consequently, several Azure services reliant on these resources encountered failures in service management operations and faced connectivity or availability issues.

A wide array of services were impacted by this incident, including:

  • App Service
  • Azure Active Directory (Microsoft Entra ID)
  • Azure Cosmos DB
  • Microsoft Sentinel
  • Azure Data Factory
  • Event Hubs
  • Service Bus
  • Log Analytics
  • SQL Database
  • SQL Managed Instance
  • Virtual Machines
  • Cognitive Services
  • Application Insights
  • Azure Resource Manager (ARM)
  • Azure NetApp Files
  • Azure Communication Services
  • Microsoft Defender
  • Azure Cache for Redis
  • Azure Database for PostgreSQL-Flexible Server
  • Azure Stream Analytics
  • Azure SignalR Service
  • App Configuration

These services experienced both failures in service management operations and connectivity or availability issues during the incident.

Preventing Future Microsoft Failures

Microsoft is taking steps to prevent future failures like the recent Azure Central US outage. They plan to implement improvements across their storage, SQL, and Cosmos DB services.

Credit: youtube.com, Microsoft Azure MELTDOWN - Active Directory Global Failure Analysis

One of the key changes is fixing the 'Allow List' generation workflow to detect incomplete source information. This should help reduce the likelihood of incidents like the recent outage.

Microsoft will also improve alerting for rejected storage requests and reduce batch sizes. This should help them catch and fix issues before they become major problems.

The company is also working on adding additional VM health checks during 'Allow List' deployments. This will help them identify and address issues before they cause a outage.

Zone-aware rollouts are also on the way, which will ensure that invalid 'Allow List' deployments revert to the last-known-good state. This should minimize the impact of any future failures.

SQL and Cosmos DB services are also working on improving their resilience to storage incidents. SQL is improving the Service Fabric cluster location change notification mechanism and implementing a zone-redundant setup for the metadata store.

Cosmos DB is addressing failover issues by adding automatic per-partition failover for active-passive accounts. This should help ensure that customers' data is always available.

Microsoft is scheduled to complete these changes progressively, with some extending into 2025.

Walter Brekke

Lead Writer

Walter Brekke is a seasoned writer with a passion for creating informative and engaging content. With a strong background in technology, Walter has established himself as a go-to expert in the field of cloud storage and collaboration. His articles have been widely read and respected, providing valuable insights and solutions to readers.

Love What You Read? Stay Updated!

Join our community for insights, tips, and more.