In the realm of modern data centers, leveraging GPU acceleration has become a crucial aspect of enhancing computational capabilities, especially in AI and machine learning workloads. VMware ESXi, a leading hypervisor, coupled with NVIDIA’s AI Enterprise software stack, empowers enterprises to efficiently deploy and manage GPU-accelerated virtualized environments. In this guide, we’ll walk you through the process of installing the NVIDIA Enterprise AI driver on VMware ESXi, enabling support for the NVIDIA H100 SXM 80GB HBM3 GPU.
Prerequisites:
- PowerEdge XE9680 server
- VMware ESXi 8.0 Update 2 installed
- Custom ESXi image profile: DEL-ESXi_802.22380479-A04
- NVIDIA H100 SXM5 80GB HBM3 GPU
Great News! NVIDIA Enterprise AI 5.0 now supports the NVIDIA H100 SXM 80GB HBM3, previously limited to bare-metal deployments (details in https://docs.nvidia.com/ai-enterprise/latest/release-notes/index.html#whats-new:~:text=Newly%20supported%20graphics,Grace%20Hopper%20Superchip).
Step 1: Download the Driver Bundle
Firstly, obtain the NVIDIA AI Enterprise 5.0 driver bundle from the NVIDIA License portal. Ensure compatibility with your ESXi version and GPU model. In this case, we are using NVIDIA-AI-Enterprise-vSphere-8.0-550.54.16-550.54.15-551.78.zip.
Step 2: Upload the Driver Bundle
Unzip the downloaded driver bundle and upload it to a shared datastore within your vSphere cluster. Utilize the vSphere client’s File browser option for seamless uploading.
Step 3: Prepare the Host
- Put the ESXi host into maintenance mode to ensure uninterrupted installation.
- SSH into the ESXi host for command-line access.
Step 4: Install the NVIDIA Enterprise AI Driver
Execute the following command to install the NVIDIA Enterprise AI driver on the ESXi host:
esxcli software vib install -d /vmfs/volumes/<datastore>/path/to/driver_bundle.zip
Replace and path/to/driver_bundle.zip with the appropriate datastore name and path. After installation, you should receive a confirmation message indicating successful installation.
[root@esx03:~] esxcli software vib install -d /vmfs/volumes/nfs-ds-1/nvidia-550/NVD-AIE-800_550.54.16-1OEM.800.1.0.20613240_23471877.zip
Installation Result
Message: Operation finished successfully.
VIBs Installed: NVD_bootbank_NVD-AIE_ESXi_8.0.0_Driver_550.54.16-1OEM.800.1.0.20613240
VIBs Removed:
VIBs Skipped:
Reboot Required: false
DPU Results:
Step 5: Reboot the Server
Reboot the ESXi host to finalize the driver installation process.
Step 6: Verify Installation
Upon reboot, ensure that the NVIDIA vGPU software package is correctly installed and loaded. Check for the NVIDIA kernel driver by running the following command:
[root@esx03:~]vmkload_mod -l |grep nvidia
nvidia 58420
You should see the NVIDIA kernel driver listed among the loaded modules.
Additionally, confirm successful communication between the NVIDIA kernel driver and the physical GPUs by running:
nvidia-smi
[root@esx03:~] nvidia-smi
Tue Mar 26 14:47:43 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.16 Driver Version: 550.54.16 CUDA Version: N/A |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA H100 80GB HBM3 On | 00000000:19:00.0 Off | 0 |
| N/A 35C P0 77W / 700W | 0MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA H100 80GB HBM3 On | 00000000:3B:00.0 Off | 0 |
| N/A 31C P0 77W / 700W | 0MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 2 NVIDIA H100 80GB HBM3 On | 00000000:4C:00.0 Off | 0 |
| N/A 30C P0 76W / 700W | 0MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 3 NVIDIA H100 80GB HBM3 On | 00000000:5D:00.0 Off | 0 |
| N/A 34C P0 75W / 700W | 0MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 4 NVIDIA H100 80GB HBM3 On | 00000000:9B:00.0 Off | 0 |
| N/A 34C P0 78W / 700W | 0MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 5 NVIDIA H100 80GB HBM3 On | 00000000:BB:00.0 Off | 0 |
| N/A 31C P0 77W / 700W | 0MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 6 NVIDIA H100 80GB HBM3 On | 00000000:CB:00.0 Off | 0 |
| N/A 33C P0 78W / 700W | 0MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 7 NVIDIA H100 80GB HBM3 On | 00000000:DB:00.0 Off | 0 |
| N/A 31C P0 78W / 700W | 0MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
[root@az1-wld-esx01:~]
This command should display information about the installed NVIDIA GPUs, confirming proper functionality.
Step 7: Repeat for Cluster Hosts
Repeat the aforementioned steps for all hosts within your vSphere cluster to ensure consistent GPU support across the environment.