Azure AutomL is a powerful tool that automates the machine learning process, allowing you to train and deploy models without extensive expertise.
AutomL uses a combination of pre-built algorithms and automated hyperparameter tuning to find the best model for your specific problem.
This approach can significantly reduce the time and effort required to develop and deploy machine learning models, making it an attractive option for businesses and organizations with limited resources.
AutomL supports a wide range of machine learning tasks, including classification, regression, and clustering, and can handle large datasets and complex problems.
Getting Started
To get started with Azure AutomL, you'll need an Azure subscription. You can create a free account if you don't already have one.
To prepare for the experiment, you'll also need to create an Azure Machine Learning workspace or compute instance. This can be done by following the Quickstart guide, which can be found in the Azure documentation.
An Azure subscription is the first step to getting started with Azure AutomL. It's a free or paid account that you can create in just a few minutes.
Here are the specific prerequisites you'll need to get started:
- An Azure subscription, which can be created for free
- An Azure Machine Learning workspace or compute instance, which can be prepared by following the Quickstart guide
- The data asset to use for the Automated ML training job, which can be an existing data asset or created from a data source
Prerequisites
To get started with Azure Machine Learning, you'll need to meet a few prerequisites. An Azure subscription is required, so if you don't have one, create a free account before proceeding.
You'll also need to have either an Azure Machine Learning experiment created with a specific type of data or an Azure Machine Learning workspace or compute instance. If you're starting from scratch, you can create a workspace or compute instance through the Quickstart: Get started with Azure Machine Learning guide.
To prepare your data, you'll need to have the data asset ready for the Automated ML training job. This could be an existing data asset or one that you create from a data source, such as a local file, web URL, or datastore.
Here are the specific requirements for the training data:
- The data asset to use for the Automated ML training job.
- Two requirements for the training data: the data asset and additional requirements not specified in the provided examples.
In some cases, you may also need to download a specific data file, such as the bankmarketing_train.csv data file, which is available under the Creative Commons (CCO: Public Domain) License.
Responsible AI Dashboard
The Responsible AI Dashboard is a powerful tool that helps you implement responsible AI practices effectively and efficiently. It's only supported for tabular data and classification and regression models.
You can access the Responsible AI Dashboard in the Azure Machine Learning studio, where you'll find a single interface to view various insights for your model. This dashboard brings together several mature Responsible AI tools, including model performance and fairness assessment, data exploration, machine learning interpretability, and error analysis.
The dashboard is essential for inspecting the model's fairness, viewing its explanations, and inspecting its errors and potential blind spots. This is because model evaluation metrics and charts only give you a general idea of a model's quality.
The Responsible AI Dashboard provides a comprehensive view of your model's performance, including metrics and charts. However, it's not just about numbers – it's also about understanding how your model works and where it might be flawed.
Here are the key areas of focus for the Responsible AI Dashboard:
- Model performance and fairness assessment
- Data exploration
- Machine learning interpretability
- Error analysis
By using the Responsible AI Dashboard, you can gain a deeper understanding of your model's strengths and weaknesses, and make data-driven decisions to improve its performance and fairness.
Automated ML Job
After creating an Automated Machine Learning job, you can view the run history and model evaluation metrics and charts in the Azure Machine Learning studio. You can do this by signing into the studio, navigating to your workspace, and selecting Jobs from the left menu.
To view the run history, select an automated ML job from the table at the bottom of the page. In the Models tab, select the Algorithm name for the model you want to evaluate. You can then use the checkboxes on the left to view metrics and charts.
The Limits section provides configuration options for the job, allowing you to specify the maximum number of trials, concurrent trials, nodes, and experiment timeout. You can also enable early termination of the job when the score isn't improving in the short term.
Here are the settings you can configure in the Limits section:
After your experiment completes, you can test the models Automated ML generates for you. You can do this by selecting an existing Automated ML experiment job, browsing to the Models tab, and selecting the completed model you want to test.
Job Configuration
To configure your Azure Automated ML job, you need to set limits for the job. These limits determine how the job will run and how many resources it can use.
The Limits section provides configuration options for several settings, including Max trials, which specifies the maximum number of trials to try during the Automated ML job, and Max concurrent trials, which specifies the maximum number of trial jobs that can be executed in parallel.
You can also set the Max nodes setting, which specifies the maximum number of nodes this job can use from the selected compute target, and the Metric score threshold, which enters the iteration metric threshold value.
Configure Job Limits
Configuring job limits is a crucial step in managing your Automated ML job. You can specify the maximum number of trials to try during the job, where each trial has a different combination of algorithm and hyperparameters.
The Max trials setting allows you to set an integer value between 1 and 1,000. This means you can try up to 1,000 different combinations of algorithms and hyperparameters to find the best model.
You can also set the Max concurrent trials setting to control the number of trial jobs that can be executed in parallel. This setting also accepts an integer value between 1 and 1,000.
To further control the job's resources, you can set the Max nodes setting to specify the maximum number of nodes the job can use from the selected compute target.
Here's a summary of the job limit settings:
These settings help you balance the trade-off between model accuracy and job duration. By setting these limits, you can ensure that your job runs within a reasonable time frame and uses the desired amount of resources.
Feature Requirements
To configure your job and get the most out of Azure AutoML, you need to meet some specific requirements.
First and foremost, you'll need an Azure subscription to use Azure AutoML features. This is a must-have, as it allows you to authenticate to Azure ML services.
To set up Azure AutoML, you'll need to create a Service Principal with the following permissions: read/write access to Blob storage, permission to create Endpoints, permission to deploy models, and permission to launch AutoML processes.
You'll also need to configure your Blob storage to be public, as this is where your training dataset and MLTable files will be uploaded.
Additionally, you'll need to create an Azure ML Workspace, which serves as a logical container for your machine-learning experiments, models, and other related resources.
Azure AutoML also requires an Azure ML compute cluster, which is a collection of multiple Azure virtual machines used to train and develop forecast models.
Finally, you'll need to ensure you have enough virtual machine quota to handle inference calls to Endpoints, and you'll need a Board Cloud license to use Azure AutoML features - On-premise Board licenses are not supported.
Here are the specific requirements summarized in a list:
- Service Principal with read/write access to Blob storage, permission to create Endpoints, permission to deploy models, and permission to launch AutoML processes
- Public Blob storage for uploading training dataset and MLTable files
- Azure ML Workspace for machine-learning experiments and models
- Azure ML compute cluster for training and developing forecast models
- Enough virtual machine quota for inference calls to Endpoints
- Board Cloud license for using Azure AutoML features
Running and Validating
To run an Automated ML experiment, you'll need to specify the configuration options in the Validate and test section. You can choose the validation type, which depends on the size of your training data: if it's larger than 20,000 rows, Automated ML applies a train/validation data split, while smaller datasets use cross-validation.
Automated ML applies default validation techniques based on the training data size. For datasets larger than 20,000 rows, the default is to take 10% of the initial training data as the validation set. For smaller datasets, the default number of folds depends on the number of rows: 10 folds for datasets with less than 1,000 rows, and three folds for datasets with 1,000 to 20,000 rows.
If you want to test the models generated by Automated ML, you can provide a test dataset in the Validate and test section. This will trigger a test job at the end of your experiment, and you can view the results in the Azure Machine Learning studio web UI.
Run Experiment
You can run an experiment with Azure Machine Learning using the Python SDK v2 or CLI v2, which are currently only supported on Azure Machine Learning remote compute cluster or compute instance.
To run an experiment, you'll need to create an MLClient in your workspace, which allows you to interact with Azure Machine Learning. This client is the foundation for running automated machine learning jobs.
If you run an experiment with the same configuration settings and primary metric multiple times, you might see variation in each experiment's final metrics score and generated models. This is due to the inherent randomness in the algorithms used by automated machine learning.
You can use the stored run ID to return information about the job, and the --web parameter opens the Azure Machine Learning studio web UI where you can drill into details on the job.
Here are the basic steps to run an experiment:
- Use the MLClient to run the following command in the workspace.
- Specify the job YAML configuration file, such as ./automl-classification-job.yml.
The configuration parameters for your experiment are set in your task method, and you can also set job training settings and exit criteria with the training and limits settings.
Validate and Test
The Validate and Test section is a crucial part of the Automated ML process, where you get to fine-tune your experiment settings and test the generated models.
Automated ML applies default validation techniques depending on the number of rows provided in the single dataset training_data. If you have a dataset larger than 20,000 rows, it will use a train/validation data split, taking 10% of the initial training data set as the validation set. If your dataset is smaller than 20,000 rows, it will use a cross-validation approach.
You can specify the validation type to use for your training job, and provide a test dataset to evaluate the recommended model.
If you have a dataset with less than 1,000 rows, Automated ML will use 10 folds for cross-validation. For datasets with 1,000 to 20,000 rows, it will use three folds.
Providing a test dataset to evaluate generated models is a preview feature, and can change at any time. This capability is an experimental preview feature, and is not available for certain Automated ML scenarios, such as computer vision tasks or many models and hierarchical time-series forecasting training.
Here's a summary of the validation techniques used by Automated ML:
Predicted vs True
The Predicted vs True chart is a powerful tool for evaluating model performance. It plots the relationship between the target feature and the model's predictions, allowing you to see if a model is biased toward predicting certain values.
The true values are binned along the x-axis, and for each bin, the mean predicted value is plotted with error bars. This helps you visualize the variance of predictions around the mean.
The line displays the average prediction, and the shaded area indicates the variance of predictions. A good model should have a predicted vs true line that is close to the ideal y = x line.
The distance of the trend line from the ideal y = x line is a good measure of model performance on outliers. If the line is far from the ideal line, it may indicate that the model struggles with rare events.
Including more data samples where the distribution is sparse can improve model performance on unseen data. This is especially useful for models that are struggling with outliers.
Viewing Results
After your automated ML experiment completes, you can view the results in the Azure Machine Learning studio or a Jupyter notebook using the JobDetails Jupyter widget. This allows you to track the history of your jobs.
To view the run history and model evaluation metrics and charts in the studio, sign into the studio and navigate to your workspace. In the left menu, select Jobs, then select your experiment from the list of experiments. Select an automated ML job, and in the Models tab, select the Algorithm name for the model you want to evaluate.
You can also view remote test job results, but this feature is currently in preview and not available for certain Automated ML scenarios. These include computer vision tasks, many models and hierarchical time-series forecasting training, forecasting tasks with deep learning neural networks, and automated ML jobs from local computes or Azure Databricks clusters.
Here are the steps to view the test job metrics of the recommended model:
- In the studio, browse to the Models page, and select the best model.
- Select the Test results (preview) tab.
- Select the job you want, and view the Metrics tab.
You can also view the test predictions used to calculate the test metrics by following these steps:
- At the bottom of the page, select the link under Outputs dataset to open the dataset.
- On the Datasets page, select the Explore tab to view the predictions from the test job.
View Job Results
Viewing job results is a crucial step in evaluating the performance of your automated machine learning experiments. You can find a history of your jobs in the Azure Machine Learning studio.
To view job results, sign into the studio and navigate to your workspace. In the left menu, select Jobs, and then select your experiment from the list of experiments. Next, in the table at the bottom of the page, select an automated ML job to view its details.
You can also view job results using a Jupyter notebook with the JobDetails Jupyter widget. To do this, follow the same steps as above, but instead of navigating to the studio, open your Jupyter notebook and use the JobDetails widget to view the job details.
Automated ML jobs can be viewed in the studio, but remote test job results are a preview feature that may change at any time. This feature is not available for certain scenarios, including computer vision tasks and many models and hierarchical time-series forecasting training.
Here are the scenarios where remote test job results are not available:
- Computer vision tasks
- Many models and hierarchical time-series forecasting training (preview)
- Forecasting tasks where deep learning neural networks (DNN) are enabled
- Automated ML jobs from local computes or Azure Databricks clusters
View Explanations
You can view model explanations while waiting for models to complete. These explanations show which data features influenced a particular model's predictions.
Model explanations can be generated on demand. To generate them, go back to the Models screen, select the Models + child jobs tab, and choose the model you want to explain.
The explainability job takes about 2-5 minutes to complete, and a green success message appears when it's done. After that, you can view the explanations dashboard.
The explanations dashboard is located in the Explanations (preview) tab. Here, you can see which data features influenced the predictions of the selected model.
Some models, like MaxAbsScaler, LightGBM, don't support model explanations for certain algorithms, such as TCNForecaster, AutoArima, and ExponentialSmoothing.
To view the feature importances, select the Aggregate feature importance tab in the Explanations (preview) tab. This chart shows which data features influenced the predictions of the selected model.
Image Classification
Automated ML uses the images from the validation dataset for evaluating the performance of the model.
The performance of the model is measured at an epoch-level to understand how the training progresses. An epoch elapses when an entire dataset is passed forward and backward through the neural network exactly once.
Automated ML provides a classification report that offers class-level values for metrics like precision, recall, f1-score, support, auc, and average_precision.
These metrics are averaged at various levels, including micro, macro, and weighted, giving you a comprehensive view of your model's performance.
Automated ML logs summary metrics like confusion matrix, classification charts, ROC curve, precision-recall curve, and classification report for the model from the best epoch.
This helps you visualize and understand the performance of your model in a more intuitive way.
Classification
Classification metrics are a crucial part of evaluating the performance of a classification model. For image classification, metrics like accuracy, precision, recall, f1-score, support, auc, and average_precision are logged at the epoch level.
Classification report provides class-level values for these metrics with various levels of averaging - micro, macro, and weighted.
For classification multi-class scenarios, metrics like accuracy, AUC_weighted, average_precision_score_weighted, norm_macro_recall, and precision_score_weighted might not optimize as well for small datasets or those with large class skew. In those cases, AUC_weighted can be a better choice for the primary metric.
Here are some example use cases for different metrics:
Supported Algorithms
Classification is a fundamental task in machine learning, and having the right algorithms at your disposal can make all the difference. The task method determines the list of algorithms or models to apply.
For classification tasks, you can choose from a variety of algorithms, including Logistic Regression, Light GBM, and Gradient Boosting. These algorithms are all marked with an asterisk in the table of supported algorithms.
If you're looking for a more traditional approach, you might consider using Decision Tree or Random Forest. Both of these algorithms are also supported for classification tasks.
Here are some of the classification algorithms you can use, grouped by category:
These are just a few examples of the many algorithms available for classification tasks. By choosing the right algorithm for the job, you can improve the accuracy and effectiveness of your machine learning models.
Classification
Classification is a fundamental task in machine learning, and it's essential to understand the metrics and algorithms involved. Classification metrics include accuracy, AUC_weighted, average_precision_score_weighted, norm_macro_recall, and precision_score_weighted, which are useful for various use cases such as image classification, sentiment analysis, and churn prediction.
The choice of metric depends on the specific problem and dataset. For example, accuracy is suitable for image classification, while AUC_weighted is better for datasets with class imbalance or small sizes.
Automated machine learning offers several algorithms for classification tasks, including Logistic Regression, Light GBM, Gradient Boosting, Decision Tree, and Random Forest. These algorithms can be used for image classification, regression, and time series forecasting tasks.
In classification multi-label scenarios, the primary metrics supported are accuracy for text classification and classification metrics defined in the ClassificationMultilabelPrimaryMetrics enum for image classification. Threshold-dependent metrics like accuracy and precision_score_weighted might not optimize well for small datasets or class imbalance, making AUC_weighted a better choice.
Here are some supported algorithms per machine learning task:
Automated ML also offers options for monitoring and evaluating training results, including performance charts and metrics, featurization summaries, and hyper-parameters used when training a model.
Explanations and Feature Importances
Inspecting which dataset features a model uses to make predictions is essential when practicing responsible AI. Model explanations and feature importances are crucial for understanding how a model works.
Automated ML provides a model explanations dashboard to measure and report the relative contributions of dataset features. This dashboard is available for most automated ML experiments.
Model explanations can be generated on demand, and the model explanations dashboard summarizes these explanations. The dashboard is part of the Explanations (preview) tab.
Some automated ML algorithms don't support model interpretability, including TCNForecaster, AutoArima, ExponentialSmoothing, Prophet, Average, Naive, Seasonal Average, and Seasonal Naive.
To view model explanations, go back to the Models screen, select the Models + child jobs tab, and select the first MaxAbsScaler, LightGBM model. Then, select Explain model and follow the prompts to generate the model explanations.
Data and Featurization
Data and Featurization are crucial steps in Azure Automated ML. You can select the View featurization settings option to see actions to perform on the data in preparation for training.
The Featurization page shows default featurization techniques for your data columns. You can enable/disable automatic featurization and customize the automatic featurization settings for your experiment. To do this, select the Enable featurization option to allow configuration.
Automatic featurization is always enabled when your data contains non-numeric columns. You can configure each available column, as desired. The following table summarizes the customizations currently available via the studio:
Data featurization involves transforming data to numbers and vectors of numbers, scaling, and normalizing it to help algorithms that are sensitive to features on different scales. Automated machine learning featurization steps become part of the underlying model.
Data Segments
Data Segments are crucial in machine learning as they help improve model accuracy and prevent overfitting. Training data, validation data, and test data are the three primary segments.
You can specify separate training data and validation data sets, but if you don't, Automated ML will apply default techniques to determine how validation is performed. This determination depends on the number of rows in the dataset.
If your training data has more than 20,000 rows, Automated ML will split the data into training and validation sets, using 10% of the initial training data set as the validation set. This validation set is then used for metrics calculation.
If your dataset is smaller than or equal to 20,000 rows, Automated ML will apply a cross-validation approach. The default number of folds depends on the number of rows, with ten folds used for datasets with fewer than 1,000 rows, and three folds used for datasets with between 1,000 and 20,000 rows.
Here's a summary of the validation techniques used by Automated ML:
Creating a data asset as a data asset is also important, as it allows you to ensure that your data is formatted appropriately for your experiment. This can be done by creating a new data asset and selecting your dataset from the list.
Data Featurization
Data Featurization is a crucial step in preparing your data for training a machine learning model. It involves transforming your data into a format that algorithms can understand.
You can select the View featurization settings option to see actions to perform on your data in preparation for training. This is where you can enable or disable automatic featurization and customize it for your experiment.
The Featurization page shows default featurization techniques for your data columns. You can change the value type for a selected column by choosing a different feature type. This is useful if your data contains non-numeric columns, as featurization is always enabled in such cases.
Automated machine learning featurization steps, such as feature normalization, handling missing data, and converting text to numeric, become part of the underlying model. The same featurization steps applied during training are applied to your input data automatically when you use the model for predictions.
The following featurization configurations are accepted:
The featurization settings don't affect the input data needed for inferencing. If you exclude columns from training, the excluded columns are still required as input for inferencing on the model.
Deployment and Testing
You can deploy the best model as a web service using the automated machine learning interface. Deployment is the integration of the model so it can predict on new data and identify potential areas of opportunity.
To deploy a model, you must register it to the workspace. After you register the model, you can locate it in the studio by selecting Models on the left menu.
The deployment process entails several steps including registering the model, generating resources, and configuring them for the web service. Deployment takes about 20 minutes to complete.
You can initiate the deployment by using one of the following methods: Populate the Deploy model pane or use one-click deployment. The Deploy model pane requires you to enter a unique name for your deployment, a description, and select the type of endpoint you want to deploy.
Here are the fields you need to populate in the Deploy model pane:
After you populate the Deploy model pane, select Deploy. A green success message appears at the top of the Job screen, and in the Model summary pane, a status message appears under Deploy status. You can monitor the deployment progress periodically by selecting Refresh.
Additional Features
In Azure AutoML, you can configure additional settings to fine-tune your experiment and prepare your data for training.
The Additional configuration page shows default values based on your experiment selection and data, and you can choose to use these defaults or configure the settings manually.
You can select the primary metric for scoring your model, which is essential for evaluating its performance.
Ensemble stacking is another feature you can enable to improve machine learning results and predictive performance by combining multiple models.
By selecting the Use all supported models option, you can instruct Azure AutoML to use all supported models in the experiment, or you can configure the Blocked models setting to exclude specific models from the training job.
If you choose to use all supported models, you can select the models to exclude from the training job using the Blocked models dropdown list.
On the other hand, if you deselect the Use all supported models option, you can configure the Allowed models setting to select the models to use for the training job.
The Explain best model option allows you to automatically show explainability on the best model created by Azure AutoML.
To calculate binary metrics, you need to enter the positive class label in the corresponding field.
Frequently Asked Questions
Which AutoML is best?
There is no single "best" AutoML, as each tool has its strengths and weaknesses, such as Cloud AutoML's neural network architecture and Uber Ludwig's minimal code requirement. To determine the best AutoML for your project, consider your specific needs and explore each tool's features and capabilities.
What is the difference between ML and AutoML?
Machine Learning (ML) requires manual steps, while AutoML automates the entire process, minimizing human intervention. This key difference makes AutoML a faster and more efficient way to apply ML to real-world problems
What is the use of AutoML?
AutoML is ideal for teams seeking productivity gains and simplicity, while custom training is best for high-quality models that require customization
What is Azure automated ML?
Azure Automated ML is a process that quickly selects the best machine learning algorithm for your specific data. It enables fast generation of machine learning models, streamlining the development process.
Sources
- https://learn.microsoft.com/en-us/azure/machine-learning/how-to-understand-automated-ml
- https://learn.microsoft.com/en-us/azure/machine-learning/how-to-use-automated-ml-for-ml-models
- https://learn.microsoft.com/en-us/azure/machine-learning/tutorial-first-experiment-automated-ml
- https://learn.microsoft.com/en-us/azure/machine-learning/how-to-configure-auto-train
- https://www.boardmanual.com/2023/summer/data-modeling/data-model-design-sections/analytics/about-azure-auto-ml.htm
Featured Images: pexels.com