Azure Chaos Studio is a managed service that makes it easy to inject failures into your applications, simulating real-world scenarios to test their resilience.
This allows you to identify and fix issues before they cause downtime or data loss.
To get started, you'll need to create a Chaos Studio instance in the Azure portal.
Azure Chaos Studio supports a wide range of chaos types, including network, disk, and process failures.
Experimenting with Azure Chaos Studio
You can validate product quality on your own terms with Azure Chaos Studio, using a hypothesis-based approach to drive application resilience. This involves integrating chaos into your CI/CD pipeline through drill events, game days, or a combination of both.
Azure Chaos Studio allows you to create experiments with a continuously growing fault library, which you can use to validate your architecture, configuration, code, monitoring, and staffing and resources. Experiments are charged per action-minute based or for the duration that your experiment actions run.
To create an Azure VM Experiment, you need to install the VM Chaos Agent and use it as a target for a Chaos Experiment. From the Chaos Studio Blade, select Experiments, select New Experiment, and complete the necessary base information, including Subscription, Resource Group, and Name.
You can allocate a new Managed Identity to the experiment and provide the necessary IAM/RBAC permissions for the Azure Target Resource. For an Azure VM, Reader permissions are sufficient. You can also use the fault library to inject faults, such as CPU Pressure, into the target resource.
Here are some of the resources you can target with Azure Chaos Studio experiments:
- Virtual Machines and Virtual Machine Scale Sets
- Cosmos DB
- Azure Cache for Redis
- Key Vault and App Service
- Azure Functions
With Azure Chaos Studio, you can simulate outages and analyze how your application handles disruptive events. You can also identify possibilities for improvements to your applications, such as the need for multi-region deployment implementation.
Benefits and Reliability
Subjecting your Azure applications to real or simulated faults can help you understand how they respond to real-world disruptions.
You can experiment with real or simulated faults in a controlled manner to better understand application resilience. This can include network latency, unexpected storage outages, expiring secrets, or even a full datacenter outage.
By intentionally disrupting your apps, you can identify gaps and plan mitigations before your customers are impacted by a problem. This can be done by integrating load testing into your chaos experiments to simulate real-world customer traffic.
Benefits of
Subjecting your Azure applications to real or simulated faults can help you identify and fix potential issues before they become major problems.
You can use Azure Chaos Studio to observe how your applications respond to real-world disruptions, giving you valuable insights into their resilience.
Integrating chaos engineering experiments into any phase of quality validation is a key benefit of Azure Chaos Studio, allowing you to test and refine your applications in a controlled environment.
Microsoft engineers use the same tools as you to build resilience into cloud services, making it easier to learn from their expertise and apply it to your own projects.
By simulating faults and observing how your applications respond, you can build more robust and reliable cloud services that can withstand real-world disruptions.
Reliability of Applications
You can improve the reliability of your Azure applications by experimenting with real or simulated faults. This allows you to better understand how your applications respond to real-world disruptions.
Subjecting your Azure apps to controlled faults helps you identify areas of weakness and plan mitigations. This is a key part of building resilience into your cloud services.
To expose resiliency issues in your application, you can simulate a fault by deleting the API's pod in an AKS deployment. This will cause the API to restart and attempt to connect to a key vault that's not accessible.
You can then verify that the application is not working as expected by navigating to the application's URL and clicking on any product category. The application should fail to load the product category page.
Here are some common scenarios that can disrupt your applications:
- Network latency
- Unexpected storage outage
- Expiring secrets
- Full datacenter outage
To improve application reliability, you can integrate load testing into your chaos experiments. This simulates real-world customer traffic and helps you identify gaps in your application's resilience.
Fault Injection and Validation
Fault injection is a powerful tool for improving application reliability, but it's not enough on its own. Integrate load testing into your chaos experiments to simulate real-world customer traffic.
You can experiment with your Azure apps by subjecting them to real or simulated faults in a controlled manner. This will help you better understand application resilience.
Disrupting your apps intentionally can help you identify gaps and plan mitigations before your customers are impacted by a problem. Chaos engineering and testing can simulate real-world disruptions like network latency or an unexpected storage outage.
Observing how your apps respond to these disruptions is crucial for understanding their resilience. Chaos experiments can also help you observe how your apps will respond to a full datacenter outage.
CPU Pressure and Performance
CPU pressure can be a major issue, causing applications to crash even when the system was running smoothly beforehand. A CPU spike can occur due to various reasons, such as a latency in database operations or a network connectivity issue.
Simulating a CPU spike in a production environment can be incredibly valuable for identifying and mitigating potential problems. This is exactly what Chaos Engineering can help with.
A CPU spike can have side effects, like causing retry operations that further spike CPU. It's essential to capture all possible side effects when simulating a CPU spike.
Chaos Engineering is an engineering discipline that involves understanding the interactions across different systems, components, and workloads. It's not just about simulating outages, but also about understanding the complexities involved.
Simulating CPU pressure can help development and engineering teams focus on releasing fixes and updating architectures to make them more fault-tolerant.
Onboarding and Deployment
To onboard and deploy Azure Chaos Studio, you need to ensure the "Microsoft.Chaos" Azure Resource Provider is enabled in your subscription. This can be done by searching for Subscriptions in the Azure portal, selecting the subscription, and registering the "Microsoft.Chaos" provider.
You also need to create a User-Managed Identity as the Identity security object for Chaos Studio to interact with Azure Resource Targets. This involves clicking Create New Resource, searching for User Managed Identity, and completing the necessary parameters.
Chaos Studio requires Application Insights and a Log Analytics WorkSpace to store metadata of the service. To deploy this, select Create New Resource and search for Application Insights, specifying the necessary parameters such as Azure Subscription, Resource Group, and Name.
Once you have these prerequisites in place, you can onboard resources by selecting Onboard Resources in the Azure Chaos Studio blade. This brings you to the Targets section, where you can filter for specific subscriptions or Resource Groups and select the Azure Resource(s) you want to use as a target.
Here's a step-by-step guide to onboarding an Azure VM to Chaos Studio:
- Select the VM and unlock the "Enable Targets" menu option
- Select Enable agent-based targets (VM, VMSS) from the menu
- Provide the necessary parameters for the deployment, including Subscription, Azure Managed Identity, and Application Insights account
Note that non-Azure VM services such as App Services and Azure Kubernetes Service (AKS) rely on the Service-Direct scenario, without the agent dependency.
Sources
- https://azure.microsoft.com/en-us/products/chaos-studio
- https://pdtit.medium.com/intro-to-chaos-engineering-and-azure-chaos-studio-preview-5e85fff10642
- https://github.com/microsoft/contosotraders-cloudtesting/blob/main/demo-scripts/azure-chaos-studio/walkthrough.md
- https://learn.microsoft.com/en-us/azure/chaos-studio/chaos-studio-chaos-experiments
- https://kristhecodingunicorn.com/post/azure-services-resilience-testing-with-azure-chaos-studio/
Featured Images: pexels.com