Azure Text to Speech is a powerful tool that converts written text into natural-sounding speech. It's a game-changer for anyone looking to create engaging audio content.
With Azure Text to Speech, you can choose from over 180 voices across 30 languages and 3 dialects. This means you can tailor your audio content to suit your audience's language and cultural preferences.
Azure Text to Speech is also highly customizable, allowing you to adjust factors like pitch, rate, and volume to suit your needs. This is particularly useful for applications like audiobooks, podcasts, and voice assistants.
By leveraging Azure Text to Speech, you can create immersive and engaging experiences that captivate your audience and leave a lasting impression.
Core Features
Azure text-to-speech offers two prebuilt neural voices that sound highly natural out of the box. You can access these voices by creating an Azure subscription and Speech resource, then using the Speech SDK or visiting the Speech Studio portal.
These prebuilt voices are available for selection in the Voice Gallery, where you can determine the right voice for your business needs. It's worth checking the pricing details to ensure you're aware of the costs involved.
To get started with the prebuilt voices, simply create an Azure subscription and Speech resource, and then use the Speech SDK or visit the Speech Studio portal.
Neural Features
Azure text to speech has some amazing neural features that make it sound incredibly natural. You can choose from prebuilt neural voices that are available right out of the box.
These voices are highly natural and can be accessed through the Speech Studio portal or by using the Speech SDK. You can check the Voice Gallery to determine the right voice for your business needs.
To get started with prebuilt neural voices, you'll need to create an Azure subscription and Speech resource. Then, you can use the Speech SDK or visit the Speech Studio portal to select prebuilt neural voices.
Custom neural voices are also available, which allow you to create a natural brand voice that's unique to your product or brand. This requires a handful of audio files and associated transcriptions.
You can choose from a range of prebuilt voices for the avatar, including prebuilt neural voices available on Azure AI Speech. The language support for text to speech avatar is the same as the language support for text to speech.
Here's a quick rundown of the different neural features available:
Remember, it's normal to try several different voices and phrases for each voice until you find one you like for your application.
Text to Speech
Text to speech is a powerful feature that can be added to your applications, making them more engaging and conversational. You can use the text-to-speech avatar feature, which incurs charges based on the length of video output, billed per second.
The real-time avatar feature, on the other hand, charges you per second, regardless of whether the avatar is speaking or remaining silent. To optimize costs, refer to the tips in the sample code, which suggests using local video for idle time to save costs.
You can suspend your endpoint to save costs, and if you want to use it again, simply redeploy the endpoint. There's also a table that shows the parameters for converting text to speech with SSML:
To get a full list of voices for a specific region or endpoint, you can use the Azure Cognitive Services. The Speech Synthesizer is a memory-intensive class that implements IDisposable, so be sure to wrap it in a using statement for responsible handling of system resources.
You can call the SpeakTextAsync method on the SpeechSynthesizer to generate the audio data, which will also speak the voice aloud using your system speakers. Be sure to dispose of the SpeechSynthesisResult instance once it's no longer needed to avoid memory issues.
Avatar Capabilities
Azure text to speech avatars can convert text into a digital video of a photorealistic human speaking with natural-sounding voices powered by Azure AI text to speech.
The avatars come with a collection of prebuilt avatars to choose from. You can select the one that best fits your needs.
The voice of the avatar is generated by Azure AI text to speech. For more information, see Avatar voice and language.
Azure text to speech avatars can synthesize text to speech avatar video asynchronously with the batch synthesis API or in real-time. This means you can choose the method that works best for your project.
To create video content without coding, you can use the content creation tool in Speech Studio. It's a great way to get started with creating video content.
You can also use the live chat avatar tool in Speech Studio to enable real-time avatar conversations. This is perfect for applications where you need to have a conversation with your users.
Here are the capabilities of Azure text to speech avatars:
- Converts text into a digital video of a photorealistic human speaking
- Provides a collection of prebuilt avatars
- The voice of the avatar is generated by Azure AI text to speech
- Synthesizes text to speech avatar video asynchronously or in real-time
- Provides a content creation tool in Speech Studio
- Enables real-time avatar conversations through the live chat avatar tool in Speech Studio
Configuration and Management
To get started with Azure Text to Speech, you'll need to create a SpeechConfig instance, which is the main object that communicates with Azure.
This object requires the key to your cognitive services or speech resource, as well as the region it was deployed to, which is essential information.
To keep your subscription key secure, it's recommended not to check it into source control, but rather store it in a configuration file that can be safely stored.
Creating a Config
To create a SpeechConfig instance, you'll need the key to your cognitive services or speech resource, as well as the region it was deployed to.
You can find the key on the Keys and Endpoints blade of your resource in the Azure portal. It doesn't matter which of the two keys you use, as Microsoft gives you two keys for security purposes.
Make sure to store your subscription key securely, as it can be used to make requests to Azure at a pay-per-use cost model. Don't check it into source control, but rather store it in a configuration file that can be securely stored.
Before you can reference these APIs in C# code, you'll need to add a reference to Microsoft.CognitiveServices.Speech using NuGet package manager or via the .NET CLI. This will give you access to the speech classes in the SDK.
Saving Audio to Disk
Saving audio to disk can be a cost-effective way to reuse generated speech in your application.
You can use Azure Cognitive Services to generate an audio file once and then save it to disk as a .wav file.
This .wav file can be played multiple times without needing to regenerate it from the speech API.
To write the audio file to disk, you can use a simple File.WriteAllBytes call.
This method will create the file if it doesn't already exist and write the audio data to it.
The file will be named output.wav and will be located in the application's runtime directory.
Configuration and Management
Monitoring Azure text to speech metrics is crucial for managing resource usage and controlling costs. You can find usage information in the Azure portal.
Key metrics for Azure text to speech services include Synthesized Characters, which tracks the number of characters converted into speech. For details on billable characters, see the relevant documentation.
Video Seconds Synthesized measures the total duration of video synthesized. This includes batch avatar synthesis, real-time avatar synthesis, and custom avatar synthesis.
To track your custom avatar model's hosting time, use the Avatar Model Hosting Seconds metric. This will give you the total time in seconds that your custom avatar model is hosted.
If you're hosting a custom neural voice model, track the Voice Model Hosting Hours metric. This will give you the total time in hours that your custom neural voice model is hosted.
Here's a summary of the key metrics for Azure text to speech services:
Available Locations
The text to speech avatar feature is only available in specific service regions, which is something to keep in mind when setting it up.
These regions include Southeast Asia, North Europe, West Europe, Sweden Central, South Central US, East US 2, and West US 2.
You'll want to make sure your service region is one of these to ensure the feature works smoothly.
Sources
- https://learn.microsoft.com/en-us/azure/ai-services/speech-service/text-to-speech
- https://accessibleai.dev/post/text-to-speech-cognitive-services/
- https://learn.microsoft.com/en-us/azure/ai-services/speech-service/text-to-speech-avatar/what-is-text-to-speech-avatar
- https://learn.microsoft.com/en-us/connectors/azuretexttospeech/
- https://dsj23.me/2024/04/16/creating-azure-function-for-azure-speech-services-speech-text-which-inputs-any-audio-format/
Featured Images: pexels.com