
Managing quotas for machine learning resources in Azure is a bit of a challenge, but don't worry, I've got you covered.
Azure provides a default quota for machine learning resources, including the number of compute instances and storage accounts.
To avoid hitting these limits, you can request a quota increase through the Azure portal or by contacting Azure support.
This can be a good option if you need more resources for a specific project or to scale up your existing machine learning workload.
Prerequisites
To view your available Azure quota, you'll need to set up the right permissions. For this, we recommend using the Cognitive Services Usages Reader role. This role provides the minimal access necessary to view quota usage across an Azure subscription.
You can find this role in the Azure portal under Subscriptions > Access control (IAM) > Add role assignment > search for Cognitive Services Usages Reader. This role must be applied at the subscription level, it does not exist at the resource level.
If you don't want to use the Cognitive Services Usages Reader role, you can use the subscription Reader role instead. However, keep in mind that this role grants read access beyond the scope of what's needed for viewing quota and model deployment.
Explore further: How to Give Access to Resource Group in Azure
Introduction

Azure quota is a feature that allows you to assign rate limits to your deployments, up to a global limit called your "quota." This quota is assigned to your subscription on a per-region, per-model basis in units of Tokens-per-Minute (TPM).
You'll receive default quota for most available models when you onboard a subscription to Azure OpenAI. This quota can be used to create deployments with varying TPM assignments.
The available quota for a model will be reduced by the TPM assigned to each deployment. You can continue to create deployments and assign them TPM until you reach your quota limit.
Here's an example of how quota works: with a quota of 240,000 TPM for GPT-35-Turbo in East US, you can create a single deployment of 240K TPM, 2 deployments of 120K TPM each, or any number of deployments in one or multiple Azure OpenAI resources as long as their TPM adds up to less than 240K total in that region.
A Requests-Per-Minute (RPM) rate limit will also be enforced, set proportionally to the TPM assignment using the following ratio: 6 RPM per 1000 TPM.
A unique perspective: Azure Deployments
Azure Quota Management

Azure OpenAI computes an estimated max-processed-token count that includes the prompt text and count, the max_tokens parameter setting, and the best_of parameter setting for each request. This estimated count is used to determine the rate limit token estimate, which is not the same as the token calculation used for billing or determining a request's input token limit.
To manage quotas, you can increase the endpoint limit, VM quota, or compute limit. To raise endpoint limit, you need to provide detailed reasons for the limit increase, including the new value for each limit and the location(s) where you need the limit increase. For compute limit increases, you need to mention "Increase total compute limits" in the summary and provide the subscription ID, region, and new limit.
Here are the default limits for Azure Resource Group limits:
Model Specific Settings
In Azure Quota Management, model specific settings allow you to control the maximum amount of TPM that can be allocated to a type of model deployment in a given region.
Different model deployments, also known as model classes, have unique max TPM values that can be controlled.
Each model class has its own specific max TPM value, which is distinct from the common max TPM value shared by all other model classes.
Model input token limits are defined in the models table and are not impacted by changes made to TPM, so you don't need to worry about affecting these limits when adjusting TPM settings.
Quota Tokens-Per-Minute (TPM) allocation is not related to model input token limits, giving you flexibility in managing your model deployments.
Expand your knowledge: Azure Subscription Limits
Understanding Rate
Rate limits are a crucial aspect of Azure quota management. They help prevent abuse of resources and ensure fair usage for all customers. Rate limits are applied to each subscription and can be increased by requesting a quota increase.
Azure OpenAI computes an estimated max processed-token count for each request, which includes the prompt text and count, max_tokens parameter setting, and best_of parameter setting. This estimated count is added to a running token count of all requests that is reset each minute.
The token count used in the rate limit calculation is an estimate based in part on the character count of the API request. This means that a rate limit can be triggered prior to what might be expected in comparison to an exact token count measurement for each request.
RPM rate limits are based on the number of requests received over time, and the rate limit expects that requests be evenly distributed over a one-minute period. If this average flow isn't maintained, then requests may receive a 429 response even though the limit isn't met when measured over the course of a minute.
To minimize issues related to rate limits, it's a good idea to use the following techniques:
- Set max_tokens and best_of to the minimum values that serve the needs of your scenario.
- Use quota management to increase TPM on deployments with high traffic, and to reduce TPM on deployments with limited needs.
- Implement retry logic in your application.
- Avoid sharp changes in the workload. Increase the workload gradually.
- Test different load increase patterns.
Resource Group
When managing Azure Resource Groups, it's essential to understand their limits to avoid hitting hard limits and experiencing issues.
One of the lesser-known facts is that Resource Groups have a default limit of 800 resources per resource type. This can be a significant constraint, especially when working with multiple resource types.
If this caught your attention, see: Azure What Is a Resource Group
While these limits are rarely hit, it's still good to know they exist. Understanding these limits can help you plan and optimize your resource group structure.
Here's a breakdown of the key limits for Resource Groups:
You can also deploy up to 800 resources per deployment, which is another important limit to keep in mind.
Management Locks are also limited to 20 per unique scope, so be mindful of this when implementing locks for your Resource Groups.
Tags also have their own set of limits, with a maximum of 15 tags per resource or resource group, and a maximum length of 512 characters for tag keys and 256 characters for tag values.
Expand your knowledge: Tags in Azure
Shared
Shared quotas can be a lifesaver for Azure users. They provide a temporary pool of resources that can be accessed by users across various regions for a limited time.
Azure Machine Learning offers a shared quota pool that can be used for running Spark jobs and testing inferencing for certain models. This pool is available for use cases that require a short-term quota increase.
To access the shared quota pool, you'll need an Enterprise Agreement subscription. This pool is not for production endpoints, but rather for creating temporary test endpoints.
You can opt out of shared quota for Spark jobs by filling out a form. This is a great option if you don't need the extra resources.
Here's a breakdown of the shared quota pool's availability:
Remember, shared quota is usage-based, just like billing for dedicated virtual machine families. Be sure to keep track of your usage to avoid any surprises on your bill.
Core in Batch Mode
Core quotas in Batch mode are a crucial aspect of Azure quota management. They determine the number of cores available for use in your Batch account.
For dedicated nodes, Batch enforces a core quota limit for each VM series, as well as a total core quota limit for the entire Batch account. This means you need to be mindful of both the individual series limits and the overall account limit.
Batch enforces a total core quota for Spot nodes, without any distinction between different VM series. This is a key difference from dedicated nodes, where quotas are applied per series.
If you created a Batch account with pool allocation mode set to user subscription, the Azure Batch core quotas don't apply. Instead, the quotas in your subscription for regional compute cores, per-series compute cores, and other resources are used and enforced.
Learning Compute
Azure Machine Learning Compute has a default quota limit on both the number of cores and the number of unique compute resources that are allowed per region in a subscription.
The quota on the number of cores is split by each VM Family and cumulative total cores.
The quota on the number of unique compute resources per region is separate from the VM core quota, as it applies only to the managed compute resources of Azure Machine Learning.
Dedicated cores per region have a default limit of 24 to 300, depending on your subscription offer type.
Low-priority cores per region have a default limit of 100 to 3,000, depending on your subscription offer type.
Total compute limit per region has a default limit of 500 per region within a given subscription and can be increased up to a maximum value of 2500 per region.
This limit is shared between training clusters, compute instances, and managed online endpoint deployments.
A compute instance is considered a single-node cluster for quota purposes.
Here's a breakdown of the maximum limits for Azure Machine Learning Compute:
Learning Endpoints
Azure Machine Learning online endpoints and batch endpoints have specific resource limits. These limits are regional, meaning you can use up to these limits per each region you're using.
The number of endpoints per subscription is limited to 100, but you can create 100 endpoints in each supported region. For example, if you're using the East US and West US regions, you can create 100 endpoints in each region.
Related reading: Azure Create Resource Group
Endpoint names must begin with a letter and be 3-32 characters in length. They can only consist of letters and numbers. This applies to all types of endpoints.
Deployment names have similar requirements, including starting with a letter and being 3-32 characters in length. They can only consist of letters and numbers.
Here's a summary of the endpoint limits:
Managed online endpoints have additional limits, including a maximum request time-out of 180 seconds and a total requests per second limit of 500.
Learning Pipelines
Learning Pipelines are a crucial part of Azure Machine Learning, and it's essential to understand their limits to avoid any issues.
You can have up to 30,000 steps in a pipeline, which is a pretty high number, but it's still a limit.
To manage your pipelines efficiently, you should be aware of the maximum number of workspaces allowed per resource group, which is 800.
If you're planning to create multiple pipelines, you'll want to keep an eye on these limits to avoid hitting the ceiling.
A fresh viewpoint: Azure Devops Pipelines
Resource Availability
Azure Resource Limits are not arbitrary limits, but rather a protective measure for both Microsoft and customers to avoid gross overspending in the cloud. These limits and quotas help keep cloud spending predictable and prevent unexpected huge monthly bills.
Microsoft Azure Limits and Quotas are implemented to throttle the growth of cloud usage within individual Azure Regions and globally across all data centers and regions. This allows Microsoft to ensure that individual customers don't overload any particular Azure Region or data center unexpectedly.
You can view your quota for various Azure resources like virtual machines, storage, or network using the Azure portal. Simply select the subscription whose quota you're looking for, and then select Usage + quotas to view your current quota limits and usage.
Azure Machine Learning provides a shared quota pool from which users across various regions can access quota to perform testing for a limited amount of time. The specific time duration depends on the use case, and this shared quota is available for running Spark jobs and testing inferencing for certain models.
Check this out: Azure Data Studio Connect to Azure Sql
Default limits vary depending on the type of subscription you use to create a Batch account, and quotas aren't guaranteed values, as they can vary based on changes from the Batch service or a user request to change a quota value.
Here are some default and maximum quotas for Azure resources:
Migrating Existing Deployments
Azure OpenAI model deployments have been automatically migrated to use quota as part of the transition to the new quota system and TPM based allocation.
This means that no manual action was required from users to migrate their existing deployments, making the process seamless and efficient.
In cases where the existing TPM/RPM allocation exceeded the default values due to previous custom rate-limit increases, equivalent TPM were assigned to the impacted deployments.
A fresh viewpoint: Azure Bicep Existing Resource
Resource Deletion
When you try to delete an Azure OpenAI resource from the Azure portal, deletion is blocked if there are still deployments present.
You need to delete the associated deployments first to free up quota allocations for new deployments.
Deleting deployments first ensures that quota allocations are properly released.
However, if you delete a resource using the REST API or another programmatic method, the quota allocation remains unavailable for 48 hours.
To free up quota immediately, you need to trigger a purge for the deleted resource.
Resource
Resource limits and quotas are in place to prevent overspending in the cloud by accident. These limits help keep your cloud spending predictable and prevent unexpected huge monthly bills.
Azure Resource Groups have limits on the number of resources and other things on an individual Resource Group. The soft or default limit for resources per Resource Group is 800, while the hard or max limit varies per resource type.
Resource Limits protect YOU from accidentally going bankrupt by generating huge monthly spend before realizing the full cost of your Azure Resources. This is especially helpful for preventing gross overspending in the cloud.
A quota is a limit, not a capacity guarantee. If you have large-scale capacity needs, contact Azure support. Quotas can vary based on changes from the Batch service or a user request to change a quota value.
Here is a list of some default resource quotas and limits:
- Azure Machine Learning assets: default limit varies, max limit varies
- Azure Machine Learning computes: default limit varies, max limit varies
- Azure Machine Learning shared quota: default limit varies, max limit varies
- Azure Machine Learning online endpoints: default limit varies, max limit varies
- Azure Machine Learning pipelines: default limit varies, max limit varies
- Azure Machine Learning integration with Synapse: default limit varies, max limit varies
- Virtual machines: default limit varies, max limit varies
- Azure Container Instances: default limit varies, max limit varies
- Azure Storage: default limit varies, max limit varies
To view your quota for various Azure resources, use the Azure portal. Select All services, then select Subscriptions under the General category, and from the list of subscriptions, select the subscription whose quota you're looking for.
Readers also liked: How to Move Azure Resources between Subscriptions
Pool Size
Pool size limits are set by the Batch service, and they can't be changed. These limits vary depending on the pool type and resource allocation mode.
For instance, if you're using a pool with inter-node communication enabled, you're subject to different limits than those with standard quotas. Compute nodes in these pools are restricted, but the exact limit isn't specified in the documentation.
Batch service pool allocation mode has a fixed limit of 100 compute nodes. This is a hard limit that can't be exceeded.
In contrast, Batch subscription pool allocation mode has a slightly lower limit of 80 compute nodes. If you're using a managed image resource to create a pool, the documentation doesn't specify a limit on compute nodes.
Dedicated nodes, on the other hand, have a limit of 2000. This is a relatively high limit, but it's still important to keep in mind when planning your resource allocation.
Here's an interesting read: Azure Kubernetes Service vs Azure Container Apps
Virtual
Virtual resources have their own set of limitations and requirements. Azure Machine Learning reserves 20% of your compute resources for upgrades on some VM SKUs, so you need to have a quota for ceil(1.2 * number of instances requested for deployment) * number of cores available.
Each subscription has a limit on the number of virtual machines across all services, and virtual machine cores have a regional total limit and a regional limit per size series. This means you can't deploy more virtual machines than your subscription allows.
You should check the Managed online endpoints SKU list to see which VM SKUs are exempt from extra quota reservation. This list is essential for planning your deployments.
To view your usage and request quota increases, you need to see View your usage and quotas in the Azure portal. This will give you a clear picture of your resource usage and help you plan for future deployments.
Here are some key limits for virtual machines:
These limits apply to each subscription and are enforced separately for each region and size series. Keep this in mind when planning your deployments.
You can't raise limits for virtual machines above the values shown in the table, so it's essential to plan carefully to avoid hitting these limits.
Storage
When working with Azure, it's essential to understand the storage limits to avoid any issues. Azure Storage has a limit of 250 storage accounts per region, per subscription. This includes both Standard and Premium storage accounts.
You'll need to plan your storage accordingly to stay within these limits. The limit of 250 storage accounts per region, per subscription, applies to both Standard and Premium storage accounts.
Frequently Asked Questions
What is the difference between capacity and quota in Azure?
Capacity in Azure refers to the actual amount of resources available, while quota is a credit limit that prevents overspending. Understanding the difference is crucial for managing your Azure resources effectively
What is DTU quota in Azure?
Azure SQL Database has a DTU quota of 15000 DTUs per logical server, which is the total performance capacity for all databases hosted on that server. This quota determines the maximum combined performance level of all databases on the server.
Sources
- https://learn.microsoft.com/en-us/azure/ai-services/openai/how-to/quota
- https://docs.incredibuild.com/cloud/cloud_set_core_quota_azure_region.html
- https://build5nines.com/azure-subscription-resource-limits-and-quotas/
- https://learn.microsoft.com/en-us/azure/machine-learning/how-to-manage-quotas
- https://docs.azure.cn/en-us/Batch/batch-quota-limit
Featured Images: pexels.com