The Llama Index Azure OpenAI integration and migration process is a significant step for businesses looking to leverage the power of large language models. This integration allows for seamless migration of models and data between Llama Index and Azure OpenAI.
You can migrate your models and data from Llama Index to Azure OpenAI using the Azure OpenAI migration tool, which supports a wide range of model formats. This tool helps to ensure a smooth transition with minimal downtime.
The migration process typically takes a few hours to complete, depending on the size of the models and data being transferred. During this time, your services may be unavailable, but this is usually a temporary disruption.
Azure OpenAI offers a scalable and secure platform for hosting large language models, with features like automatic model scaling and data encryption. This provides a reliable and efficient way to deploy and manage your models in the cloud.
Prerequisites
To get started with LlamaIndex on Azure OpenAI, you'll need a few things. First, you'll need an Azure subscription. This will give you access to the Azure AI model inference API, which we'll be using in this tutorial.
You'll also need an Azure AI project, which you can create by following the instructions at Create a project in Azure AI Foundry portal.
For this example, we're using a specific model called Mistral-Large, but you can choose any model that supports the Azure AI model inference API. If you want to use embeddings capabilities in LlamaIndex, you'll need an embedding model like cohere-embed-v3-multilingual.
To install the necessary packages, you'll need Python 3.8 or later, including pip. You can install LlamaIndex using pip install llama-index.
Here are the specific packages you'll need to install:
- llama-index-llms-azure-inference (version 0.2.4 or later)
- llama-index-embeddings-azure-inference (version 0.2.4 or later)
Make sure to install the correct versions of these packages to avoid any issues.
Azure AI Inference Service
Azure AI Inference Service requires at least version 0.2.4 of the LlamaIndex integration.
If you're using Azure AI model inference service, you need to pass the model_name parameter.
Using a wrong api_version or one not supported by the model results in a ResourceNotFound exception.
You should check which API version your deployment is using to avoid this issue.
Using LLMs
You can use LLMs models directly or configure the models used by your code in LlamaIndex. To use the model directly, use the chat method for chat instruction models. This allows you to stream the outputs.
The complete method is still available for model of type chat-completions, where your input text is converted to a message with role="user". This is a convenient way to work with chat completions.
Setup and Configuration
To set up Llama Index with Azure OpenAI, you'll need to create an Azure account and get an API key.
You'll also need to install the @langchain/openai integration package. This will allow you to access Azure OpenAI embedding models.
Before you can use Azure OpenAI, you need to have an instance deployed. You can deploy a version on Azure Portal following a guide.
To access your Azure OpenAI instance, you'll need to know the name of your instance and key. You can find the key in the Azure Portal, under the “Keys and Endpoint” section of your instance.
If you're using Node.js, you can define environment variables to use the service. This includes setting the AZURE_OPENAI_API_EMBEDDINGS_DEPLOYMENT_NAME and AZURE_OPENAI_API_DEPLOYMENT_NAME variables.
You can also set your LangSmith API key for automated tracing of your model calls. This involves uncommenting a specific line of code.
API and Deployment
You can deploy Meta Llama models to serverless API endpoints with pay-as-you-go billing, providing a way to consume models as an API without hosting them on your subscription.
Each deployment has a rate limit of 200,000 tokens per minute and 1,000 API requests per minute, and you can contact Microsoft Azure Support if these limits aren't sufficient for your scenarios.
To deploy the model to a serverless API endpoint, use the Azure AI Foundry portal, Azure Machine Learning SDK for Python, the Azure CLI, or ARM templates.
You can also deploy the model to a self-hosted managed compute, but you must have enough quota in your subscription, or use temporary quota access if you don't have enough quota available.
Here's a summary of the deployment options:
- Serverless API endpoints: pay-as-you-go billing, rate limits of 200,000 tokens per minute and 1,000 API requests per minute
- Self-hosted managed compute: requires enough quota in your subscription, or temporary quota access
API Reference
The API reference is a crucial part of any API and deployment strategy. Understanding the different components of the API reference can help you navigate and utilize the API more effectively.
The API reference includes an overview of the API, which provides a high-level summary of its features and functionality. This section is a great starting point for anyone looking to learn about the API.
You can also find information on setting up the API, including instantiation and indexing and retrieval. These sections are particularly useful for developers who need to integrate the API into their applications.
Direct usage of the API is also covered in the reference, including how to use custom headers and migrate from the Azure OpenAI SDK. These sections are great for developers who want to get hands-on experience with the API.
Here's a breakdown of the different sections of the API reference:
- Overview: Provides a high-level summary of the API's features and functionality.
- Setup: Covers instantiation, indexing and retrieval, and other setup-related topics.
- Direct Usage: Includes information on using custom headers and migrating from the Azure OpenAI SDK.
- Using Azure Managed Identity: Explains how to use Azure Managed Identity with the API.
- Using a different domain: Covers how to use the API with a different domain.
- Custom headers: Provides information on how to use custom headers with the API.
- Migration from Azure OpenAI SDK: Helps developers migrate from the Azure OpenAI SDK to the current API.
A Model Deployment
You can deploy Meta Llama models to serverless APIs for pay-as-you-go billing, which provides enterprise security and compliance.
This type of deployment doesn't require quota from your subscription, making it a convenient option.
To deploy to a serverless API endpoint, you can use the Azure AI Foundry portal, Azure Machine Learning SDK for Python, the Azure CLI, or ARM templates.
You can also deploy to a self-hosted managed compute, but you'll need to have enough quota in your subscription.
If you don't have enough quota, you can request temporary quota access, which will be deleted in 168 hours.
Meta Llama models can be customized and controlled when deployed to a self-hosted managed compute.
Cost and Quotas
When deploying Meta Llama models as serverless API endpoints, you're limited to 200,000 tokens per minute and 1,000 API requests per minute per deployment.
Each deployment has its own quota, and if you need more, you'll need to contact Microsoft Azure Support.
You can find the Azure Marketplace pricing for serverless API deployments, which will help you track costs associated with your project.
A new resource is created to track costs each time you subscribe to a given offer from the Azure Marketplace.
You can monitor costs for your project in the Azure portal by tracking the costs associated with inference.
To track costs for your project, see the Azure Marketplace pricing when deploying the model, and use the meters available to track each scenario independently.
Deploying Meta Llama models to managed compute is billed based on core hours of the associated compute instance.
The cost of the compute instance is determined by the size of the instance, the number of instances running, and the run duration.
It's a good practice to start with a low number of instances and scale up as needed.
You can monitor the cost of the compute instance in the Azure portal.
Sources
- https://learn.microsoft.com/en-us/azure/ai-studio/how-to/develop/llama-index
- https://python.langchain.com/docs/integrations/vectorstores/azuresearch/
- https://js.langchain.com/docs/integrations/text_embedding/azure_openai
- https://docs.litellm.ai/docs/proxy/user_keys
- https://learn.microsoft.com/en-us/azure/ai-studio/how-to/deploy-models-llama
Featured Images: pexels.com