Understanding the Causes of Azure Outage and Its Impact

Author

Reads 402

Computer server in data center room
Credit: pexels.com, Computer server in data center room

Azure outages can have a significant impact on businesses, causing losses and damage to reputation.

According to a study, 60% of companies reported a significant loss of revenue due to a cloud outage.

Azure outages can be caused by a variety of factors, including network connectivity issues.

A network connectivity issue can occur when there is a problem with the underlying infrastructure, such as a faulty router or a misconfigured firewall.

This can cause delays in data transfer and ultimately lead to an outage.

Azure outages can also be caused by software issues, such as bugs or glitches in the Azure platform.

A software issue can occur when there is a problem with the code or configuration of the Azure platform.

This can cause unexpected behavior and errors, leading to an outage.

In addition, Azure outages can be caused by human error.

Credit: youtube.com, Microsoft Azure Outage: What Went Wrong?

A human error can occur when a user or administrator accidentally deletes or modifies a critical resource, causing an outage.

This can be a costly mistake, especially if it affects multiple customers.

Azure outages can also be caused by external factors, such as natural disasters or cyber attacks.

A natural disaster can cause physical damage to the data center, leading to an outage.

A cyber attack can compromise the security of the Azure platform, causing an outage.

What Caused the Outage?

The recent Azure outage was caused by a configuration change, according to Microsoft. This change was made around 18:22 UTC and impacted services that leverage Azure Front Door (AFD), its modern cloud Content Delivery Network (CDN).

The company rolled back the change and started seeing recovery from 19:25 UTC, with many Microsoft services failing away from AFD in response to the issue.

However, customers also reported experiencing errors connecting to Azure services, including Azure DevOps, in the United Kingdom and Brazil.

Credit: youtube.com, Azure This Week: Azure AD Outage - What Happened?

The Azure status page didn't show any information about services being affected for at least an hour, and it also failed to load for many customers during the outage.

A distributed denial-of-service (DDoS) attack was the trigger event for a previous nine-hour Azure outage, which was caused by an error in the implementation of Microsoft's DDoS protection mechanisms.

This attack targeted multiple Azure Front Door and CDN sites, and it was a volumetric TCP SYN flood DDoS attack.

Microsoft has experienced several outages in the past, including a faulty Enterprise Configuration Service (ECS) deployment in July 2022 and a Wide Area Network IP change in January 2023.

In June 2023, the company's Azure, Outlook, and OneDrive web portals were taken down in Layer 7 DDoS attacks by a threat actor tracked as Anonymous Sudan (aka Storm-1359), believed to have Russian ties.

Background and Context

Microsoft Azure has experienced a string of outages in recent years, with the company attributing some to configuration changes and others to distributed denial-of-service (DDoS) attacks.

Credit: youtube.com, Azure This Week: Azure AD Outage - What Happened?

In the past, Azure outages have had a significant impact on multiple services and regions, with the Central US region being one of the most affected areas.

Some of the services that have been impacted by Azure outages include Azure Active Directory, Azure Cosmos DB, Microsoft Sentinel, and SQL Database.

Microsoft has acknowledged that its cloud architecture has a concentration of risk due to excessive cross-dependencies between services, which can result in frequent outages with a large blast radius.

Background and Context

Cloud outages can happen to anyone, regardless of their size or reputation. In fact, a recent outage on Azure and Microsoft services lasted for over 8 hours.

The outage occurred between 11:45 and 19:43 UTC on July 30, 2024, affecting a subset of customers globally. It was triggered by a Distributed Denial-of-Service (DDoS) attack, which was amplified by an error in the implementation of Microsoft's defenses.

Microsoft's mitigation efforts involved implementing networking configuration changes and performing failovers to alternate networking paths. This helped to mitigate the impact of the attack, but some customers still experienced less than 100% availability.

Credit: youtube.com, 6 Essential Steps To Follow In Research Background and Context Writing

The outage highlights the ease with which DDoS actors can cause significant disruptions to critical business services. As Donny Chong, Director at Nexusguard, noted, "Anyone can carry out an attack of this magnitude from their own bedroom if they have the right equipment."

Azure tends to have outages that affect many services or regions at the same time, due to excessive cross-dependencies between services. This was the case in the recent outage, which impacted multiple Azure services in the Central US region.

The root cause of the outage was an Azure configuration update that disrupted the connection between compute and storage resources. This critically affected the availability of Virtual Machines (VMs), leading to failures in service management operations and connectivity or availability issues.

Some of the services affected by the outage included App Service, Azure Active Directory, Azure Cosmos DB, and Virtual Machines. The outage also affected downstream services, including 365-based email, which was not listed in Microsoft's initial report.

Here are some of the services that were impacted by the outage:

  • App Service
  • Azure Active Directory (Microsoft Entra ID)
  • Azure Cosmos DB
  • Microsoft Sentinel
  • Azure Data Factory
  • Event Hubs
  • Service Bus
  • Log Analytics
  • SQL Database
  • SQL Managed Instance
  • Virtual Machines
  • Cognitive Services
  • Application Insights
  • Azure Resource Manager (ARM)
  • Azure NetApp Files
  • Azure Communication Services
  • Microsoft Defender
  • Azure Cache for Redis
  • Azure Database for PostgreSQL-Flexible Server
  • Azure Stream Analytics
  • Azure SignalR Service
  • App Configuration

Not the First Time

Computer server in data center room
Credit: pexels.com, Computer server in data center room

It's not the first time Microsoft has encountered issues with Azure AD. Microsoft is working on increased fail-safety for Azure AD, which is a good thing.

The fact that both major incidents belong to the same class of risks suggests that similar problems might happen again.

Frequently Asked Questions

Did CrowdStrike cause the Azure outage?

No, CrowdStrike did not cause the Azure outage. The two incidents were unrelated, despite occurring close in time and affecting Microsoft systems.

Glen Hackett

Writer

Glen Hackett is a skilled writer with a passion for crafting informative and engaging content. With a keen eye for detail and a knack for breaking down complex topics, Glen has established himself as a trusted voice in the tech industry. His writing expertise spans a range of subjects, including Azure Certifications, where he has developed a comprehensive understanding of the platform and its various applications.

Love What You Read? Stay Updated!

Join our community for insights, tips, and more.