Azure CycleCloud is a managed service that simplifies the deployment and management of high-performance computing (HPC) clusters on Azure.
It provides a scalable and secure environment for running HPC workloads, allowing users to focus on their research and applications rather than the underlying infrastructure.
With CycleCloud, users can create and manage clusters in minutes, and scale them up or down as needed to match changing workloads.
This flexibility and ease of use make it an ideal solution for researchers and scientists who need to run complex simulations and analyses.
Configuration and Setup
To configure Azure CycleCloud, you'll need to follow the instructions on the GUI, which can be a bit tricky. You'll need to provide information about your service principle so that an Azure subscription is available as a cloud provider.
To set up a cluster in CycleCloud, you'll need to create an LSF cluster in the CycleCloud UI, start the cluster, restart the mbatchd on the master node, and then start a job requesting resources from cyclecloudprov_templates.json.
Here are the steps to set up a fully-managed LSF cluster type:
- Copy LSF installers into the blobs/ directory.
- Upload the lsf binaries to the cyclecloud locker, for example, by using pogo sync.
- Import the cluster as a service offering.
- Add the cluster to your managed cluster list in the CycleCloud UI.
To enable custom shared resource types, you'll need to add specific properties to lsb.shared, and you can see how to do this by inspecting cyclecloudprov_templates.json and user_data.sh.
Configuration
To configure Azure CycleCloud, you'll need to follow the instructions on the GUI. This is where things can get a bit tricky, as you'll need to provide the information of a service principle to make an Azure subscription available as a cloud provider.
To customize Azure CycleCloud, there are two tutorials you can follow. These tutorials will walk you through the process of setting up and configuring your CycleCloud environment.
To set up a cluster in CycleCloud, you'll need to create a LSF cluster in the CycleCloud UI, start the cluster, restart mbatchd on the master node, and then start a job requesting resources from cyclecloudprov_templates.json.
Here are the steps to set up a fully-managed LSF cluster type:
1. Copy LSF installers into the blobs/ directory.
2. Upload the lsf binaries to the cyclecloud locker.
3. Import the cluster as a service offering using the command `cyclecloud import_cluster LSF-full -c lsf -f lsf-full.txt -t`.
4. Add the cluster to your managed cluster list in the CycleCloud UI.
5. Follow the configuration menu, save the cluster, and START it.
Some important notes to keep in mind when setting up a fully-managed LSF cluster:
- To avoid race conditions in HA master setup, transient software installation failures with recovery are expected.
- cyclecloudprov_templates.json is not automatically updated, so you'll need to update the host attributes (mem, ncpus, etc) if you change the machine type and restart mbatchd.
Resource Group
To set up a resource group for CycleCloud, use the az account list-locations command to find a valid location code for your subscription. Not all services are available in all locations.
The location code you choose will determine which services are available to you.
Create a resource group to organize your resources for CycleCloud. This will make it easier to manage and scale your resources as needed.
You can create a resource group using the Azure CLI or the Azure portal.
A Deployed Environment
A CycleCloud HPC system can be deployed on Azure infrastructure, with CycleCloud itself installed as an application server on a VM in Azure.
CycleCloud requires three subnets for production: cycle, compute, and user, which are needed to create HPC clusters in the GUI.
The makeup of the HPC system is defined entirely through CycleCloud templates, which can include components like HPC scheduler head nodes, compute nodes, login nodes, bastion hosts, and other supporting infrastructure.
In a non-production environment, one subnet is sufficient.
A typical deployed environment includes mandatory resources like a Virtual Machine for running CycleCloud, a Shared filesystem for users' home directories, and a storage account for CycleCloud projects storage.
Here's a breakdown of the typical architecture:
- Virtual Machine for running CycleCloud
- Shared filesystem for users' home directories
- Storage account for CycleCloud projects storage
Optional components include an Azure Managed Lustre Filesystem in its own subnet, and a Bastion host for secured connectivity to the CycleCloud web portal and SSH in the login nodes.
In some cases, a Bastion host will be required, especially in environments where public IP is not allowed, and a Virtual Network Gateway or Azure Bastion is used in a hub and spoke pattern.
Environment Variables for User_data.sh
Cyclecloud/LSF automatically sets certain variables in the run environment of user_data.sh, which is a crucial part of your configuration and setup process.
These variables include rc_account and template_id, which are essential for your script to function correctly.
You'll also find providerName, which defaults to cyclecloud if not specified.
The clustername and cyclecloud_nodeid variables are also set automatically, giving you valuable information about your environment.
Any attributes specified in the userData template are also available for use in user_data.sh.
Components and Features
CycleCloud offers a range of features to manage your compute resources. You can manage virtual machines and scale sets to provide a flexible set of compute resources that can meet your dynamic workload requirements.
CycleCloud's scheduler-agnostic design allows you to use standard HPC schedulers like Slurm, PBS Pro, LSF, Grid Engine, and HTCondor, or extend CycleCloud autoscaling plugins to work with your own scheduler.
To give you a better idea of CycleCloud's capabilities, here are some of its key features:
Capabilities
CycleCloud offers a range of capabilities that make it a powerful tool for managing high-performance computing (HPC) resources.
You can use standard HPC schedulers like Slurm, PBS Pro, LSF, Grid Engine, and HTCondor, or extend CycleCloud autoscaling plugins to work with your own scheduler.
Managing compute resources is a breeze with CycleCloud, as you can easily scale up or down to meet your dynamic workload requirements.
CycleCloud allows you to automatically adjust cluster size and components based on job load, availability, and time requirements, making it easy to optimize your resources.
Here are some of the key capabilities of CycleCloud:
With these capabilities, you can optimize your HPC resources and improve your workflow efficiency.
HPC
High Performance Computing (HPC) on CycleCloud is a powerful tool for managing complex workloads. CycleCloud provides an FQDN (e.g. cyclecloud.westeurope.azurecontainer.io) mapping to an external IP once deployed.
CycleCloud supports four native HPC schedulers: Slurm, PBS, HTCondor, and Grid Engine. This makes it easy to integrate with existing workflows and tools.
The fully-managed LSF cluster is a completely automated cluster that will start a filesystem for LSF_TOP, high-availability LSF master nodes, as well as all the LSF configuration files, and worker nodes. This cluster template is called lsf-full.txt.
CycleCloud can deploy a fully managed LSF cluster to Azure, which is a great option for those who want a hassle-free experience. This is done by importing the cluster as a service offering using the CLI.
CycleCloud provides a flexible set of compute resources that can meet dynamic workload requirements. This is done by managing virtual machines and scale sets.
CycleCloud supports a number of compute scenarios, including tightly-coupled MPI jobs, high-throughput parallel tasks, gpu-accelerated workloads and low priority VirtualMachines. This is achieved by configuring custom shared resource types.
Here are some of the compute scenarios supported by CycleCloud:
Container Instance
The CycleCloud is packaged as RPM, DEB or container, making it a versatile option for deployment.
It can be installed on Azure Container Instances, giving users a reliable and efficient way to run the CycleCloud.
Currently, the container does not support Kubernetes, which means it can't be run on AKS.
Placement Groups
Placement Groups are used to describe the Infiniband network extents in Azure Datacenters for HPC scenarios. These networks have limited span and require special handling in LSF and CycleCloud.
In Azure, Infiniband-enabled VMs that reside in the same placement group will share an Infiniband network. This is determined by the placementGroupName in the LSF template for Cyclecloud from cyclecloudprov_templates.json.
The placementGroupName in this file is intentional and necessary, as it matches the host attribute placementgroup. This ensures that nodes borrowed from CycleCloud will reside in the same placement group.
The placement_group_id is set in userData to be used in user_data.sh at host start time. This is an important detail to keep in mind when working with placement groups.
The ondemandmpi attribute is used to prevent jobs from matching on hosts where placementGroup is undefined. This prevents potential issues with job placement.
Frequently Asked Questions
What is the difference between Azure batch and CycleCloud?
Azure Batch is geared towards developers building a capability into their own product or service, with its own scheduler, while CycleCloud is designed for traditional Linux HPC admins, offering industry-standard schedulers like Slurm or LSF. This difference in focus makes them suitable for distinct use cases.
How to access CycleCloud?
To access CycleCloud, navigate to the Monitoring / Cycle Cloud menu. From there, you can view the status of nodes in your cluster through the Cycle Cloud portal.
Sources
- https://tsi-ccdoc.readthedocs.io/en/master/Tech-tips/HPC-with-Azure-CycleCloud.html
- https://learn.microsoft.com/en-us/azure/cyclecloud/overview
- https://learn.microsoft.com/en-us/azure/cyclecloud/qs-install-marketplace
- https://learn.microsoft.com/en-us/azure/cyclecloud/overview-ccws
- https://github.com/Azure/cyclecloud-lsf
Featured Images: pexels.com