With Red Hat OpenShift, you can take advantage of shared GPU capabilities to optimize AI workloads. This can lead to significant cost savings and improved performance.
Shared GPU capabilities are made possible by Red Hat OpenShift's ability to manage multiple GPUs within a single node. This allows for more efficient use of resources and better utilization of GPU power.
Red Hat OpenShift's support for shared GPU capabilities can help reduce the number of nodes required to run AI workloads, which can lead to lower costs and improved scalability.
By optimizing AI workloads with shared GPU and Red Hat OpenShift, organizations can achieve faster time-to-insight and improved business outcomes.
Installation
To install the NVIDIA GPU Operator, follow the guidance in Installing the NVIDIA GPU Operator on OpenShift.
The process involves installing the operator, which is a software component that manages and automates the deployment of NVIDIA GPUs on your OpenShift cluster.
Follow the guidance in Installing the NVIDIA GPU Operator to install the NVIDIA GPU Operator.
This will allow you to utilize the power of your GPUs for AI workloads, making your OpenShift cluster more efficient and powerful.
Configuration
To configure the NVIDIA GPU Operator in OpenShift Virtualization, you can use it to provision worker nodes for running GPU-accelerated virtual machines. This allows you to take advantage of the power of NVIDIA GPUs in your AI workloads.
The NVIDIA GPU Operator is installed through the OpenShift Container Platform web console, where you can select the Operator from the Installed Operators list and click on it to access its configuration options.
To create a cluster policy instance, you need to follow a series of steps: select the ClusterPolicy tab, click on Create ClusterPolicy, and assign the default name gpu-cluster-policy. You also need to expand the dropdown menu for NVIDIA GPU/vGPU Driver config and specify the licensing config map name, repository path, image name, and imagePullSecret.
Here are the specific configuration options you need to set:
- Licensing config map name: enter the name of the licensing config map created in the previous section
- NLS Enabled: check the nlsEnabled checkbox
- RDMA: check the enabled checkbox if you want to deploy GPUDirect RDMA
- Repository path: specify the repository path under the NVIDIA GPU/vGPU Driver config section
- Image name: specify the NVIDIA vGPU driver version under the NVIDIA GPU/vGPU Driver config section
- ImagePullSecret: specify the imagePullSecret created in the section "Create NGC secret"
After configuring these options, click on Create to proceed with the installation. The GPU Operator will then install all the required components to set up the NVIDIA GPUs in the OpenShift Container Platform cluster.
Wait for at least 10-20 minutes before verifying the ClusterPolicy installation, as the installation process may take some time to finish. Once the installation is complete, you can verify the ClusterPolicy installation from the CLI by running a command that lists each node and the number of GPUs.
Container Platform
OpenShift Container Platform on bare metal or VMware vSphere with GPU Passthrough doesn't require changes to the ClusterPolicy. You can follow the guidance in Installing the NVIDIA GPU Operator to install the NVIDIA GPU Operator.
The option to use the vGPU driver is available for bare metal and VMware vSphere VMs with GPU Passthrough. You can follow the guidance in the section "OpenShift Container Platform on VMware vSphere with NVIDIA vGPU" for more information.
To verify the ClusterPolicy installation, run the command `oc get clusterpolicy gpu-cluster-policy -o jsonpath='{.status.state}'` from the CLI. This lists the status of the ClusterPolicy, which changes to "State:ready" when the installation succeeds.
Using Virtual Machines
You can use virtual machines (VMs) with OpenShift Virtualization to create virtualized environments for your applications. This is especially useful for large clusters.
To use virtual GPUs with VMs, you need to set the vgpuManager to true in the ClusterPolicy manifest, and also set the disableMDEVConfiguration to false in the HyperConverged custom resource (CR).
You can configure mediated devices using two methods: the NVIDIA method, which only uses the GPU Operator, or the OpenShift Virtualization method, which uses the OpenShift Virtualization features to schedule mediated devices.
Here are the key differences between the two methods:
To configure mediated devices using the OpenShift Virtualization method, you need to set the disableMDEVConfiguration to false in the HyperConverged CR, and also set the vgpuManager to true in the ClusterPolicy manifest.
You can create mediated devices and expose them to the cluster by editing the HyperConverged CR. This involves adding the mediatedDeviceTypes and nodeMediatedDeviceTypes configurations, as well as the mdevNameSelector and resourceName values.
To expose mediated devices to the cluster, you need to add the following values to the HyperConverged CR:
- mdevNameSelector: This is the name of the mediated device
- resourceName: This is the resource name allocated on the node
For example, to expose a mediated device with the name GRID T4-2Q, you would add the following values to the HyperConverged CR:
mdevNameSelector: GRID T4-2Q
resourceName: nvidia.com/GRID_T4-2Q
Container Platform on Bare Metal
You can deploy OpenShift Container Platform on bare metal without making changes to the ClusterPolicy. This is a great option if you want to leverage the power of bare metal.
For this setup, you'll need to follow the guidance in Installing the NVIDIA GPU Operator, which will help you install the NVIDIA GPU Operator. This is because bare metal deployments can benefit from GPU acceleration.
If you want to use the vGPU driver with your bare metal setup, you'll need to follow the guidance in the section “OpenShift Container Platform on VMware vSphere with NVIDIA vGPU”. This will allow you to take full advantage of your GPU resources.
Container Platform
You can run OpenShift Container Platform on bare metal or VMware vSphere with GPU Passthrough without making changes to the ClusterPolicy.
The NVIDIA GPU Operator can be installed on OpenShift Container Platform bare metal or VMware vSphere with GPU Passthrough by following the guidance in Installing the NVIDIA GPU Operator.
The option exists to use the vGPU driver with bare metal and VMware vSphere VMs with GPU Passthrough.
To use the vGPU driver with bare metal and VMware vSphere VMs with GPU Passthrough, follow the guidance in the section “OpenShift Container Platform on VMware vSphere with NVIDIA vGPU”.
Installing the NVIDIA vGPU Host Driver VIB on the ESXi host is out of the scope of this document and requires following the NVIDIA AI Enterprise Deployment Guide.
To install OpenShift on vSphere, follow the steps outlined in the Installing vSphere section of the RedHat OpenShift documentation.
To create the Cluster Policy Instance, you need to follow these steps:
- In the OpenShift Container Platform web console, select Operators > Installed Operators, and click NVIDIA GPU Operator.
- Select the ClusterPolicy tab, then click Create ClusterPolicy.
- Expand the drop down for NVIDIA GPU/vGPU Driver config and then licensingConfig, and enter the name of the licensing config map.
- Expand the rdma menu and check enabled if you want to deploy GPUDirect RDMA.
- Specify the repository path, image name, and NVIDIA vGPU driver version under the NVIDIA GPU/vGPU Driver config section.
- Expand the Advanced configuration menu and specify the imagePullSecret.
- Click Create.
The GPU Operator proceeds to install all the required components to set up the NVIDIA GPUs in the OpenShift Container Platform cluster.
Procedure
To create a ConfigMap, start by navigating to Home > Projects and ensuring the nvidia-gpu-operator is selected. This is an essential step in the process.
Next, select the Workloads Drop Down menu, which will lead you to the ConfigMaps section. From there, click on Create ConfigMap to begin the configuration process.
You'll need to enter the details for your ConfigMap, including the name and data. The example output shows a ConfigMap named "licensing-config" with a client configuration token and a grid configuration file.
Here's a step-by-step guide to creating a ConfigMap:
- Navigate to Home > Projects and select the nvidia-gpu-operator.
- Select the Workloads Drop Down menu.
- Select ConfigMaps.
- Click Create ConfigMap.
- Enter the details for your ConfigMap.
- Click Create.
Conclusion
Congratulations, you've made it to the end of this guide! By carefully following the steps outlined, you should now have a fully functional OpenShift cluster optimized for AI workloads.
This is a significant achievement, as a well-configured cluster is the foundation of any successful AI project. You should now be able to run demanding AI workloads with ease, thanks to the optimized setup.
Infrastructure
To set up a shared GPU with Red Hat OpenShift AI, you'll need a robust infrastructure that can handle the demands of AI workloads.
The NVIDIA vGPU Host Driver VIB needs to be installed on the ESXi host, but this is outside the scope of this document.
You'll want to consider using VMware vSphere, which is a popular choice for virtualization and can be used with NVIDIA vGPUs.
The NVIDIA AI Enterprise Deployment Guide provides detailed instructions for installing the vGPU Host Driver VIB, so be sure to check that out.
Red Hat OpenShift AI is designed to work seamlessly with NVIDIA vGPUs, making it an ideal choice for AI workloads that require high-performance computing.
Red Hat Features
Red Hat OpenShift Container Platform supports the NVIDIA GPU Operator, which enables autoscaling of NVIDIA GPUs.
This means you can dynamically scale your GPU resources as needed, without manual intervention.
Autoscaling is particularly useful for AI workloads that require variable amounts of GPU power.
With Red Hat OpenShift, you can easily integrate NVIDIA GPUs into your containerized applications.
Here are some key features of NVIDIA GPUs on Red Hat OpenShift:
- Autoscaling NVIDIA GPUs on Red Hat OpenShift
Frequently Asked Questions
What is the difference between OpenShift and OpenShift AI?
OpenShift is a container application platform, while OpenShift AI is a specialized platform for artificial intelligence and machine learning, built on top of OpenShift. OpenShift AI adds AI/ML tools to the OpenShift ecosystem, making it a more comprehensive solution for AI development and deployment.
What is an AI GPU?
A GPU, or Graphics Processing Unit, is a specialized processor that powers AI systems with superfast parallel processing capabilities. It's the brain behind AI's heavy work, making complex tasks faster and more efficient
Sources
- https://docs.nvidia.com/datacenter/cloud-native/openshift/latest/nvaie-with-ocp.html
- https://docs.openshift.com/container-platform/4.14/virt/virtual_machines/advanced_vm_management/virt-configuring-virtual-gpus.html
- https://docs.nvidia.com/datacenter/cloud-native/openshift/23.6.1/nvaie-with-ocp.html
- https://medium.com/@tcij1013/deploying-gpus-for-ai-workloads-on-openshift-on-aws-9f43b9ce2875
- https://www.thefastmode.com/technology-solutions/38183-red-hat-unveils-openshift-ai-2-15-with-advanced-ai-ml-features-for-hybrid-cloud-deployment
Featured Images: pexels.com