To start, it's essential to understand that Azure PTU is a highly scalable and flexible solution, capable of handling large volumes of data and traffic. This means you'll need to consider factors like expected traffic, data storage needs, and performance requirements when sizing your PTU.
The good news is that Azure provides a range of tools and resources to help you estimate and size your PTU correctly. For example, the Azure Pricing Calculator can help you estimate costs based on your expected usage, while the Azure Architecture Center offers guidance on designing and sizing Azure solutions.
Ultimately, the key to successful PTU estimation and sizing is to carefully consider your specific needs and requirements, and to use the right tools and resources to get it right.
Sizing and Estimation
Sizing and Estimation is a crucial step in optimizing performance and cost when working with Azure PTU. To determine the right amount of provisioned throughput units (PTUs) required for your workload, review the system level throughput estimation recommendations in Azure's performance and latency documentation.
There are two main approaches to estimate system level throughput: using input and output TPM or request level data. The built-in capacity planner in the deployment details section of the deployment dialogue screen can help streamline the sizing and allocation of quota to a PTU deployment for a given workload.
To estimate provisioned capacity using request level data, open the capacity planner in the Azure AI Foundry and enter the following parameters: Model, Version, Peak calls per min, Tokens in prompt call, and Tokens in model response. The capacity calculator will provide an estimate of the PTU units required for the provided workload inputs.
Here's a summary of the required parameters for the capacity calculator:
Keep in mind that the capacity calculators provide an estimate based on simple input criteria, and the most accurate way to determine your capacity is to benchmark a deployment with a representational workload for your use case.
Typical Scenarios
Understanding your expected Tokens Per Minute (TPM) usage is crucial before migrating workloads to PTU.
You should understand your expected TPM usage in detail prior to migrating workloads to PTU. This will help you size your PTU correctly and avoid any potential issues down the line.
Variable token usage is common in function calling and agent use cases. This can make it challenging to estimate TPM usage without proper planning.
Sizing and Estimation for Deployments
To determine the right amount of provisioned throughput, or PTUs, you require for your workload is an essential step to optimizing performance and cost.
You can use Azure OpenAI capacity calculators to estimate the number of PTUs required to support a given workload. This is especially useful if you're not familiar with the different approaches available to estimate system level throughput.
To get a quick estimate for your workload using input and output TPM, leverage the built-in capacity planner in the deployment details section of the deployment dialogue screen.
The built-in capacity planner is part of the deployment workflow to help streamline the sizing and allocation of quota to a PTU deployment for a given workload.
To estimate provisioned capacity using request level data, open the capacity planner in the Azure AI Foundry. The capacity calculator is under Shared resources > Model Quota > Azure OpenAI Provisioned.
Here are the parameters you'll need to enter based on your workload:
After filling out the required details, select the Calculate button in the output column. The values in the output column are the estimated value of PTU units required for the provided workload inputs.
The capacity calculators provide an estimate based on simple input criteria. The most accurate way to determine your capacity is to benchmark a deployment with a representational workload for your use case.
Estimate Throughput and Cost
To estimate throughput and cost, you should consider using the built-in capacity planner in the deployment details section of the deployment dialogue screen. This planner helps streamline the sizing and allocation of quota to a PTU deployment for a given workload.
For a quick estimate, you can leverage the built-in capacity planner by filling out the input and output TPM data. After selecting the Calculate button, you'll view your PTU allocation recommendation.
To estimate provisioned capacity using request level data, open the capacity planner in the Azure AI Foundry. The capacity calculator is under Shared resources > Model Quota > Azure OpenAI Provisioned.
The capacity calculators provide an estimate based on simple input criteria, but the most accurate way to determine your capacity is to benchmark a deployment with a representational workload for your use case.
To estimate provisioned throughput units and cost, you'll need to enter the following parameters:
After filling in the required details, select the Calculate button in the output column.
Throughput Purchase Model
The throughput purchase model for Azure PTUs is based on two main options: Hourly and Reservation. With the Hourly model, you pay on-demand for the number of deployed PTUs, making it ideal for short-term deployment needs like validating new models or acquiring capacity for a hackathon.
Azure OpenAI Provisioned and Global Provisioned are purchased on-demand at an hourly basis based on the number of deployed PTUs. This model provides substantial term discounts via the purchase of Azure Reservations.
For customers with consistent long-term usage, the Reservation model offers a better value proposition due to considerable discounts. If you're onboarding prior to the August self-service update, you can continue to use the Commitment model alongside the Hourly/reservation purchase model.
However, the Commitment model is not available for new customers. If the deployment size is changed, the costs of the deployment will adjust to match the new number of PTUs.
Here's a summary of the throughput purchase model options:
Optimizing Token Throughput
You should consider switching to provisioned deployments when you have well-defined, predictable throughput and latency requirements, typically when an application is ready for production or has already been deployed in production.
To estimate provisioned throughput units and cost, you can use the built-in capacity planner in the deployment details section of the deployment dialogue screen.
The built-in capacity planner is part of the deployment workflow to help streamline the sizing and allocation of quota to a PTU deployment for a given workload.
To estimate provisioned capacity using request level data, open the capacity planner in the Azure AI Foundry under Shared resources > Model Quota > Azure OpenAI Provisioned.
Here are the parameters you need to enter based on your workload:
After filling out the required details, select the Calculate button to view your PTU allocation recommendation.
The capacity calculators provide an estimate based on simple input criteria, but the most accurate way to determine your capacity is to benchmark a deployment with a representational workload for your use case.
Azure Reservations and Monitoring
Azure Reservations can be canceled after purchase, but credits are limited, so it's essential to carefully consider your needs before committing to a term.
To prevent over-purchasing a reservation, create deployments first and then purchase the Azure Reservation to cover the PTUs you have deployed.
You can purchase Azure Reservations via the Azure portal, not the Azure AI Foundry portal, and they are purchased regionally, allowing for flexible scoping to cover usage from a group of deployments.
To avoid being charged at the hourly rate for excess PTUs, ensure that the size of your provisioned deployments does not exceed the amount of the reservation.
Here are the reservation scopes:
- New reservations can be purchased to cover the same scope as existing reservations.
- The scope of existing reservations can also be updated at any time without penalty.
- Reservations can be flexibly scoped to cover usage from a group of deployments.
Azure Reservations for OpenAI
Azure Reservations for OpenAI can provide significant discounts on your hourly usage price. To take advantage of these discounts, you'll need to purchase an Azure Reservation for Azure OpenAI provisioned deployments.
Azure Reservations are purchased via the Azure portal, not the Azure AI Foundry portal. This means you'll need to access the Azure portal to make your purchase.
Reservations can be flexibly scoped to cover usage from a group of deployments, and you can update the scope of existing reservations at any time without penalty. However, be aware that reservations can be canceled after purchase, but credits are limited.
If the size of provisioned deployments within the scope of a reservation exceeds the amount of the reservation, the excess will be charged at the hourly rate. For example, if deployments amounting to 250 PTUs exist within the scope of a 200 PTU reservation, 50 PTUs will be charged on an hourly basis until the deployment sizes are reduced to 200 PTUs, or a new reservation is created to cover the remaining 50.
To ensure you get the most out of your Azure Reservation, it's recommended that you create deployments prior to purchasing a reservation to prevent over-purchasing. This will help you take full advantage of the reservation discount and avoid committing to a term you can't use.
The PTU amounts in reservation purchases are independent of PTUs allocated in quota or used in deployments. This means you can purchase a reservation for more PTUs than you have in quota, or can deploy for the desired region, model, or version.
Here are some key things to keep in mind when purchasing Azure Reservations for OpenAI:
- Purchase reservations after creating deployments to prevent over-purchasing.
- Verify authorization to purchase reservations in advance of needing to do so.
- Monitor your reservation utilization to ensure it's receiving the usage you're expecting.
- Keep your reservation sizes in line with your deployed PTUs to avoid excess charges.
OpenAI Usage Monitoring
OpenAI usage monitoring is a crucial practice for controlling costs and anomalies. Azure OpenAI uses Azure Monitor under the hood, so you should start checking the available metrics.
The Azure OpenAI Metrics Dashboard is a great place to begin, providing important information about HTTP Requests, Tokens-Based Usage, PTU Utilization, and Fine-tuning. This dashboard helps you understand how your users are sending AI requests and identify peaks of usage during different times.
You can also analyze Processed Prompt Tokens, Generated Completion Tokens, and Processed Inference Tokens, which impact your AI costs. These metrics are useful for monitoring your AI usage and making informed decisions.
For more advanced monitoring, you can export all metrics and log data to your Log Analytics workspace using diagnostic settings in Azure Monitor. This allows you to use KQL to query your logs and gain deeper insights into your Azure OpenAI resource.
The table to query is AzureDiagnostics, and you can use a KQL query to inspect the Request and Response log from Azure OpenAI.
Frequently Asked Questions
What is a ptu in Azure?
A PTU (Provisioned Throughput Unit) is a unit of measurement for model processing capacity in Azure, used to size deployments for required throughput. It enables you to achieve the desired processing speed for prompts and completions.
Sources
- https://learn.microsoft.com/en-us/azure/ai-services/openai/how-to/provisioned-throughput-onboarding
- https://techcommunity.microsoft.com/t5/ai-azure-ai-services-blog/unleashing-ptu-token-throughput-with-kv-cache-friendly-prompt-on/ba-p/4170161
- https://www.readynez.com/en/blog/microsoft-azure-openai-vs-chatgpt-what-s-the-difference/
- https://journeyofthegeek.com/2024/05/17/azure-openai-service-the-value-of-response-headers-and-log-correlation/
- https://demiliani.com/2023/12/19/monitoring-your-azure-openai-usage/
Featured Images: pexels.com