How to Install NVIDIA Enterprise AI Driver on VMware ESXi

In the realm of modern data centers, leveraging GPU acceleration has become a crucial aspect of enhancing computational capabilities, especially in AI and machine learning workloads. VMware ESXi, a leading hypervisor, coupled with NVIDIA’s AI Enterprise software stack, empowers enterprises to efficiently deploy and manage GPU-accelerated virtualized environments. In this guide, we’ll walk you through the process of installing the NVIDIA Enterprise AI driver on VMware ESXi, enabling support for the NVIDIA H100 SXM 80GB HBM3 GPU.

Prerequisites:

  • PowerEdge XE9680 server
  • VMware ESXi 8.0 Update 2 installed
  • Custom ESXi image profile: DEL-ESXi_802.22380479-A04
  • NVIDIA H100 SXM5 80GB HBM3 GPU

Great News! NVIDIA Enterprise AI 5.0 now supports the NVIDIA H100 SXM 80GB HBM3, previously limited to bare-metal deployments (details in https://docs.nvidia.com/ai-enterprise/latest/release-notes/index.html#whats-new:~:text=Newly%20supported%20graphics,Grace%20Hopper%20Superchip).

Step 1: Download the Driver Bundle

Firstly, obtain the NVIDIA AI Enterprise 5.0 driver bundle from the NVIDIA License portal. Ensure compatibility with your ESXi version and GPU model. In this case, we are using NVIDIA-AI-Enterprise-vSphere-8.0-550.54.16-550.54.15-551.78.zip.

Step 2: Upload the Driver Bundle

Unzip the downloaded driver bundle and upload it to a shared datastore within your vSphere cluster. Utilize the vSphere client’s File browser option for seamless uploading.

Step 3: Prepare the Host

  1. Put the ESXi host into maintenance mode to ensure uninterrupted installation.
  2. SSH into the ESXi host for command-line access.

Step 4: Install the NVIDIA Enterprise AI Driver

Execute the following command to install the NVIDIA Enterprise AI driver on the ESXi host:

esxcli software vib install -d /vmfs/volumes/<datastore>/path/to/driver_bundle.zip

Replace and path/to/driver_bundle.zip with the appropriate datastore name and path. After installation, you should receive a confirmation message indicating successful installation.

[root@esx03:~]  esxcli software vib install -d /vmfs/volumes/nfs-ds-1/nvidia-550/NVD-AIE-800_550.54.16-1OEM.800.1.0.20613240_23471877.zip
Installation Result
   Message: Operation finished successfully.
   VIBs Installed: NVD_bootbank_NVD-AIE_ESXi_8.0.0_Driver_550.54.16-1OEM.800.1.0.20613240
   VIBs Removed:
   VIBs Skipped:
   Reboot Required: false
   DPU Results:

Step 5: Reboot the Server

Reboot the ESXi host to finalize the driver installation process.

Step 6: Verify Installation

Upon reboot, ensure that the NVIDIA vGPU software package is correctly installed and loaded. Check for the NVIDIA kernel driver by running the following command:

[root@esx03:~]vmkload_mod -l |grep nvidia 

nvidia                   58420

You should see the NVIDIA kernel driver listed among the loaded modules.

Additionally, confirm successful communication between the NVIDIA kernel driver and the physical GPUs by running:

nvidia-smi
[root@esx03:~] nvidia-smi
Tue Mar 26 14:47:43 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.16              Driver Version: 550.54.16      CUDA Version: N/A      |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA H100 80GB HBM3          On  |   00000000:19:00.0 Off |                    0 |
| N/A   35C    P0             77W /  700W |       0MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA H100 80GB HBM3          On  |   00000000:3B:00.0 Off |                    0 |
| N/A   31C    P0             77W /  700W |       0MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA H100 80GB HBM3          On  |   00000000:4C:00.0 Off |                    0 |
| N/A   30C    P0             76W /  700W |       0MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA H100 80GB HBM3          On  |   00000000:5D:00.0 Off |                    0 |
| N/A   34C    P0             75W /  700W |       0MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   4  NVIDIA H100 80GB HBM3          On  |   00000000:9B:00.0 Off |                    0 |
| N/A   34C    P0             78W /  700W |       0MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   5  NVIDIA H100 80GB HBM3          On  |   00000000:BB:00.0 Off |                    0 |
| N/A   31C    P0             77W /  700W |       0MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   6  NVIDIA H100 80GB HBM3          On  |   00000000:CB:00.0 Off |                    0 |
| N/A   33C    P0             78W /  700W |       0MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   7  NVIDIA H100 80GB HBM3          On  |   00000000:DB:00.0 Off |                    0 |
| N/A   31C    P0             78W /  700W |       0MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+
[root@az1-wld-esx01:~]

This command should display information about the installed NVIDIA GPUs, confirming proper functionality.

Step 7: Repeat for Cluster Hosts

Repeat the aforementioned steps for all hosts within your vSphere cluster to ensure consistent GPU support across the environment.