Major Azure Outage: Causes and Next Steps for Recovery

Credit: pexels.com, Engineer fixing core swith in data center room

On June 7, 2023, Microsoft's Azure cloud platform experienced a major outage that left many users unable to access their services.

The outage was caused by a configuration issue in the Azure Active Directory (Azure AD) service.

This issue affected a wide range of Azure services, including Azure Storage, Azure SQL Database, and Azure Active Directory itself.

The outage lasted for several hours, with some users reporting issues as late as 10pm EST.

Microsoft's Azure team worked quickly to resolve the issue, but the outage still had significant impacts on users and businesses that rely on the platform.

For more insights, see: Analysis Services Azure

What Happened?

A major Azure outage occurred in April this year, which was analyzed by The Futurum Group. This report highlighted Azure's tendency to have outages that affect many services or regions at the same time.

The root cause of the outage goes back to the fundamentals of cloud architecture. According to the report, Azure has a concentration of risk due to excessive cross-dependencies between services, resulting in frequent outages with a large blast radius.

Expand your knowledge: Azure Communication Services

Credit: youtube.com, Massive worldwide IT outage, Microsoft investigating azure outage | World News | WION

Between July 18 and 19, 2024, customers experienced significant issues with multiple Azure services in the Central US region. The disruption was caused by an Azure configuration update that disrupted the connection between compute and storage resources.

The affected services included, but were not limited to:

App Service
Azure Active Directory (Microsoft Entra ID)
Azure Cosmos DB
Microsoft Sentinel
Azure Data Factory
Event Hubs
Service Bus
Log Analytics
SQL Database
SQL Managed Instance
Virtual Machines
Cognitive Services
Application Insights
Azure Resource Manager (ARM)
Azure NetApp Files
Azure Communication Services
Microsoft Defender
Azure Cache for Redis
Azure Database for PostgreSQL-Flexible Server
Azure Stream Analytics
Azure SignalR Service
App Configuration

These services experienced both failures in service management operations and connectivity or availability issues during the incident.

Impact and Issues

The unexpected usage spike on Azure Front Door (AFD) components led to intermittent errors, timeouts, and latency spikes.

Microsoft took immediate action to fix the issue, implementing network configuration changes and performing failovers to provide alternate network paths for relief.

These changes successfully mitigated the impacts of the usage spike, but caused some side effects to certain services.

The company is now updating their mitigation approach to minimize these side effects, applying Safe Deployment Practices starting in Asia Pacific regions and expanding in phases.

This means that some services are still experiencing intermittent errors, but Microsoft is working to resolve the issue as quickly as possible.

Take a look at this: Onedrive Stuck on Processing Changes

Microsoft's Response

Credit: youtube.com, Microsoft investigating Azure outage after massive worldwide IT outage | WION Breaking

Microsoft quickly acknowledged the outage, stating that it was caused by a "network configuration issue" that affected multiple Azure regions.

The company's engineers worked around the clock to resolve the issue, but not before many users reported experiencing errors and downtime.

Microsoft's Azure Status page was updated frequently to keep customers informed about the status of the outage.

The company apologized for the inconvenience caused by the outage and promised to conduct a thorough investigation to prevent similar issues in the future.

Microsoft's engineers were able to deploy a fix to resolve the issue, but not before it had been ongoing for several hours.

The company's swift response and communication helped to mitigate the impact of the outage and maintain customer trust.

You might enjoy: Microsoft Azure Erp

Microsoft 365 Outage

Microsoft is investigating an ongoing global outage blocking access to some Microsoft 365 and Azure services.

The outage started roughly one hour ago and has affected users worldwide, with many reporting issues connecting to Microsoft 365 websites and Outlook.

Expand your knowledge: Microsoft Azure Site Recovery

Credit: youtube.com, MICROSOFT 365 AND AZURE OUTAGE CAUSED BY DDOS ATTACK

Hundreds of reports have been received by Downdetector, with affected users saying Entra, Intune, and Power Apps are down.

The affected services include the Microsoft 365 admin center, Intune, Entra, Power BI, and Power Platform services, while SharePoint Online, OneDrive for Business, Microsoft Teams, and Exchange Online are not affected.

Users who can access the impacted Microsoft 365 services may experience latency or degraded feature performance.

Microsoft has implemented a networking configuration change, and some Microsoft 365 services have performed failovers to alternate networking paths to provide relief.

The outage was caused by an unexpected usage spike that resulted in Azure Front Door (AFD) and Azure Content Delivery Network (CDN) components performing below acceptable thresholds.

Microsoft is updating its mitigation approach to minimize side effects and applying safe deployment practices, starting in Asia Pacific regions and then expanding in phases.

The company says the vast majority of customers and services are fully mitigated, and its engineers are in the final stages of validating recovery.

Take a look at this: Quickbooks Enterprise Online Hosting

Looking Ahead

Credit: youtube.com, Microsoft Azure Global Outage - Cybersecurity news.

As we move forward, it's clear that the impact of outages like the major Azure outage will only continue to grow. The ripple effects can disrupt business operations, affect customer experiences, and even impact the broader economy.

Regulatory scrutiny is intensifying, particularly in regions like Europe where the Digital Operational Resilience Act (DORA) is set to enforce stricter standards. DORA aims to enhance the resilience of digital services by imposing requirements on financial entities to ensure they can withstand, respond to, and recover from all types of ICT-related disruptions and threats.

Companies like Microsoft must prioritize enhancing the resilience and reliability of cloud services. This includes investing in advanced monitoring and mitigation technologies, adopting best practices for configuration management, and ensuring that robust disaster recovery and business continuity plans are in place.

The importance of transparency and communication during outages cannot be overstated. Clear and timely communication with customers is critical to managing the situation effectively and maintaining trust.

Microsoft's detailed post-incident report and the steps they took to address the issue set a standard for how such situations should be handled.

For your interest: Connections - Oracle Fusion Cloud Applications

Preparedness and Planning

Credit: youtube.com, Is Azure up? Outages, resilience, and Azure Service Health alerts

Microsoft is planning to implement several improvements across its storage, SQL, and Cosmos DB services to reduce the likelihood and impact of incidents like the recent outage.

These improvements include fixing the 'Allow List' generation workflow to detect incomplete source information and improving alerting for rejected storage requests. This will help prevent similar issues in the future.

Microsoft is also reducing batch sizes and adding additional VM health checks during 'Allow List' deployments. This will help identify and resolve problems before they cause an outage.

The company is planning a zone-aware rollout for these deployments, which will ensure that invalid 'Allow List' deployments revert to the last-known-good state.

Microsoft is working on adopting the Resilient Ephemeral OS disk improvement to enhance VM resilience to storage incidents in SQL and Cosmos DB services.

This change will help prevent data loss and minimize downtime in the event of a storage incident.

The Cosmos DB environment is planning to address failover issues by adding automatic per-partition failover for active-passive accounts.

Here's an interesting read: Azure as a Service

Frequently Asked Questions

What is the Azure outage July 19th 2024?

Microsoft Azure experienced a significant outage in July 2024, affecting multiple services and regions, but service was restored within hours. This incident highlights the importance of cloud infrastructure for modern businesses.

Sources

Calvin Connelly

Senior Writer

View Calvin's Profile

Calvin Connelly is a seasoned writer with a passion for crafting engaging content on a wide range of topics. With a keen eye for detail and a knack for storytelling, Calvin has established himself as a versatile and reliable voice in the world of writing. In addition to his general writing expertise, Calvin has developed a particular interest in covering important and timely subjects that impact society.

View Calvin's Profile

Major Azure Outage Explained: What Happened and What's Next

What Happened?

Impact and Issues

Microsoft's Response

Microsoft 365 Outage

Looking Ahead

Preparedness and Planning

Frequently Asked Questions

What is the Azure outage July 19th 2024?

Sources

Related Reads

Choosing Azure vs Azure DevOps: A Detailed Comparison Guide

Unlocking Azure with Azure-Common Python Module Essentials

Azure PowerShell vs Azure CLI: Choosing the Best Tool

Categories

Major Azure Outage Explained: What Happened and What's Next

What Happened?

Impact and Issues

Microsoft's Response

Microsoft 365 Outage

Looking Ahead

Preparedness and Planning

Frequently Asked Questions

What is the Azure outage July 19th 2024?

Sources

Related Reads

Choosing Azure vs Azure DevOps: A Detailed Comparison Guide

Unlocking Azure with Azure-Common Python Module Essentials

Azure PowerShell vs Azure CLI: Choosing the Best Tool

Love What You Read? Stay Updated!

Categories