
On June 7, 2023, Microsoft's Azure cloud platform experienced a major outage that left many users unable to access their services.
The outage was caused by a configuration issue in the Azure Active Directory (Azure AD) service.
This issue affected a wide range of Azure services, including Azure Storage, Azure SQL Database, and Azure Active Directory itself.
The outage lasted for several hours, with some users reporting issues as late as 10pm EST.
Microsoft's Azure team worked quickly to resolve the issue, but the outage still had significant impacts on users and businesses that rely on the platform.
For more insights, see: Analysis Services Azure
What Happened?
A major Azure outage occurred in April this year, which was analyzed by The Futurum Group. This report highlighted Azure's tendency to have outages that affect many services or regions at the same time.
The root cause of the outage goes back to the fundamentals of cloud architecture. According to the report, Azure has a concentration of risk due to excessive cross-dependencies between services, resulting in frequent outages with a large blast radius.
Expand your knowledge: Azure Communication Services
Between July 18 and 19, 2024, customers experienced significant issues with multiple Azure services in the Central US region. The disruption was caused by an Azure configuration update that disrupted the connection between compute and storage resources.
The affected services included, but were not limited to:
- App Service
- Azure Active Directory (Microsoft Entra ID)
- Azure Cosmos DB
- Microsoft Sentinel
- Azure Data Factory
- Event Hubs
- Service Bus
- Log Analytics
- SQL Database
- SQL Managed Instance
- Virtual Machines
- Cognitive Services
- Application Insights
- Azure Resource Manager (ARM)
- Azure NetApp Files
- Azure Communication Services
- Microsoft Defender
- Azure Cache for Redis
- Azure Database for PostgreSQL-Flexible Server
- Azure Stream Analytics
- Azure SignalR Service
- App Configuration
These services experienced both failures in service management operations and connectivity or availability issues during the incident.
Impact and Issues
The unexpected usage spike on Azure Front Door (AFD) components led to intermittent errors, timeouts, and latency spikes.
Microsoft took immediate action to fix the issue, implementing network configuration changes and performing failovers to provide alternate network paths for relief.
These changes successfully mitigated the impacts of the usage spike, but caused some side effects to certain services.
The company is now updating their mitigation approach to minimize these side effects, applying Safe Deployment Practices starting in Asia Pacific regions and expanding in phases.
This means that some services are still experiencing intermittent errors, but Microsoft is working to resolve the issue as quickly as possible.
Take a look at this: Onedrive Stuck on Processing Changes
Microsoft's Response
Microsoft quickly acknowledged the outage, stating that it was caused by a "network configuration issue" that affected multiple Azure regions.
The company's engineers worked around the clock to resolve the issue, but not before many users reported experiencing errors and downtime.
Microsoft's Azure Status page was updated frequently to keep customers informed about the status of the outage.
The company apologized for the inconvenience caused by the outage and promised to conduct a thorough investigation to prevent similar issues in the future.
Microsoft's engineers were able to deploy a fix to resolve the issue, but not before it had been ongoing for several hours.
The company's swift response and communication helped to mitigate the impact of the outage and maintain customer trust.
You might enjoy: Microsoft Azure Erp
Microsoft 365 Outage
Microsoft is investigating an ongoing global outage blocking access to some Microsoft 365 and Azure services.
The outage started roughly one hour ago and has affected users worldwide, with many reporting issues connecting to Microsoft 365 websites and Outlook.
Expand your knowledge: Microsoft Azure Site Recovery
Hundreds of reports have been received by Downdetector, with affected users saying Entra, Intune, and Power Apps are down.
The affected services include the Microsoft 365 admin center, Intune, Entra, Power BI, and Power Platform services, while SharePoint Online, OneDrive for Business, Microsoft Teams, and Exchange Online are not affected.
Users who can access the impacted Microsoft 365 services may experience latency or degraded feature performance.
Microsoft has implemented a networking configuration change, and some Microsoft 365 services have performed failovers to alternate networking paths to provide relief.
The outage was caused by an unexpected usage spike that resulted in Azure Front Door (AFD) and Azure Content Delivery Network (CDN) components performing below acceptable thresholds.
Microsoft is updating its mitigation approach to minimize side effects and applying safe deployment practices, starting in Asia Pacific regions and then expanding in phases.
The company says the vast majority of customers and services are fully mitigated, and its engineers are in the final stages of validating recovery.
Take a look at this: Quickbooks Enterprise Online Hosting
Looking Ahead
As we move forward, it's clear that the impact of outages like the major Azure outage will only continue to grow. The ripple effects can disrupt business operations, affect customer experiences, and even impact the broader economy.
Regulatory scrutiny is intensifying, particularly in regions like Europe where the Digital Operational Resilience Act (DORA) is set to enforce stricter standards. DORA aims to enhance the resilience of digital services by imposing requirements on financial entities to ensure they can withstand, respond to, and recover from all types of ICT-related disruptions and threats.
Companies like Microsoft must prioritize enhancing the resilience and reliability of cloud services. This includes investing in advanced monitoring and mitigation technologies, adopting best practices for configuration management, and ensuring that robust disaster recovery and business continuity plans are in place.
The importance of transparency and communication during outages cannot be overstated. Clear and timely communication with customers is critical to managing the situation effectively and maintaining trust.
Microsoft's detailed post-incident report and the steps they took to address the issue set a standard for how such situations should be handled.
For your interest: Connections - Oracle Fusion Cloud Applications
Preparedness and Planning
Microsoft is planning to implement several improvements across its storage, SQL, and Cosmos DB services to reduce the likelihood and impact of incidents like the recent outage.
These improvements include fixing the 'Allow List' generation workflow to detect incomplete source information and improving alerting for rejected storage requests. This will help prevent similar issues in the future.
Microsoft is also reducing batch sizes and adding additional VM health checks during 'Allow List' deployments. This will help identify and resolve problems before they cause an outage.
The company is planning a zone-aware rollout for these deployments, which will ensure that invalid 'Allow List' deployments revert to the last-known-good state.
Microsoft is working on adopting the Resilient Ephemeral OS disk improvement to enhance VM resilience to storage incidents in SQL and Cosmos DB services.
This change will help prevent data loss and minimize downtime in the event of a storage incident.
The Cosmos DB environment is planning to address failover issues by adding automatic per-partition failover for active-passive accounts.
Here's an interesting read: Azure as a Service
Frequently Asked Questions
What is the Azure outage July 19th 2024?
Microsoft Azure experienced a significant outage in July 2024, affecting multiple services and regions, but service was restored within hours. This incident highlights the importance of cloud infrastructure for modern businesses.
Sources
- https://www.theregister.com/2024/09/12/att_microsoft_365_outage/
- https://www.theregister.com/2024/07/30/microsofts_azure_portal_outage/
- https://futurumgroup.com/insights/microsofts-central-us-azure-outage-what-went-wrong/
- https://www.bleepingcomputer.com/news/microsoft/microsoft-365-and-azure-outage-takes-down-multiple-services/
- https://www.windowscentral.com/microsoft/microsoft-yesterdays-azure-and-365-server-outage-was-caused-by-a-ddos-attack
Featured Images: pexels.com