Uninstalling NVIDIA Enterprise AI Drivers from ESXi

This blog post guides you through uninstalling NVIDIA Enterprise AI drivers from an ESXi 8.0U2 host.

Putting the ESXi Host into Maintenance Mode

Before modifying software configurations, it’s crucial to put the ESXi host into maintenance mode. This ensures no running virtual machines are affected during the process.

Checking Installed NVIDIA Drivers

Once in maintenance mode, SSH to the host and use the following command to identify currently installed NVIDIA drivers:

[root@esx1-sree-lab:~] esxcli software vib list | grep -i NVD
NVD-AIE_ESXi_8.0.0_Driver      535.154.02-1OEM.800.1.0.20613240      NVD     VMwareAccepted    2024-03-20    host
nvdgpumgmtdaemon               535.154.02-1OEM.700.1.0.15843807      NVD     VMwareAccepted    2024-03-20    host

The output will display details like driver name, version, and installation date. In the example, the following NVIDIA VIBs are found:

  • NVD-AIE_ESXi_8.0.0_Driver
  • nvdgpumgmtdaemon

Removing the Driver VIBs

Now, proceed to remove the listed VIBs using the esxcli software vib remove command. Here’s how to remove each VIB:

  • nvdgpumgmtdaemon:
[root@esx1-sree-lab:~] esxcli software vib remove -n nvdgpumgmtdaemon
Removal Result
   Message: The update completed successfully, but the system needs to be rebooted for the changes to be effective.
   VIBs Installed:
   VIBs Removed: NVD_bootbank_nvdgpumgmtdaemon_535.154.02-1OEM.700.1.0.15843807
   VIBs Skipped:
   Reboot Required: true
   DPU Results:

This command removes the nvdgpumgmtdaemon VIB. The output will confirm successful removal and indicate a required reboot for changes to take effect.

  • NVD-AIE_ESXi_8.0.0_Driver:
[root@esx1-sree-lab:~] esxcli software vib remove -n NVD-AIE_ESXi_8.0.0_Driver
Removal Result
   Message: The update completed successfully, but the system needs to be rebooted for the changes to be effective.
   VIBs Installed:
   VIBs Removed: NVD_bootbank_NVD-AIE_ESXi_8.0.0_Driver_535.154.02-1OEM.800.1.0.20613240
   VIBs Skipped:
   Reboot Required: true
   DPU Results:

Similarly, this command removes the main NVIDIA driver VIB and prompts for a reboot.

Rebooting the ESXi Host

After removing both VIBs, it’s essential to reboot the ESXi host to apply the changes. Use the following command:

[root@esx1-sree-lab:~] reboot

The host will reboot, and the NVIDIA drivers will be uninstalled.

Verifying Uninstallation

Once the ESXi host restarts, confirm that the NVIDIA drivers are no longer present. Use the same command as before to check the installed VIBs:

[root@esx1-sree-lab:~] esxcli software vib list | grep -i NVD

If the output is empty or doesn’t contain any NVIDIA-related entries, the drivers have been successfully uninstalled.

Important Notes:

  • This guide serves as a general overview. Always refer to the official documentation for your specific NVIDIA driver version and ESXi host configuration for detailed instructions.
  • Putting the ESXi host into maintenance mode is crucial to avoid disruptions to running virtual machines.

By following these steps, you can effectively uninstall NVIDIA Enterprise AI drivers from your ESXi 8.0U2 host

How to Install NVIDIA Enterprise AI Driver on VMware ESXi

In the realm of modern data centers, leveraging GPU acceleration has become a crucial aspect of enhancing computational capabilities, especially in AI and machine learning workloads. VMware ESXi, a leading hypervisor, coupled with NVIDIA’s AI Enterprise software stack, empowers enterprises to efficiently deploy and manage GPU-accelerated virtualized environments. In this guide, we’ll walk you through the process of installing the NVIDIA Enterprise AI driver on VMware ESXi, enabling support for the NVIDIA H100 SXM 80GB HBM3 GPU.

Prerequisites:

  • PowerEdge XE9680 server
  • VMware ESXi 8.0 Update 2 installed
  • Custom ESXi image profile: DEL-ESXi_802.22380479-A04
  • NVIDIA H100 SXM5 80GB HBM3 GPU

Great News! NVIDIA Enterprise AI 5.0 now supports the NVIDIA H100 SXM 80GB HBM3, previously limited to bare-metal deployments (details in https://docs.nvidia.com/ai-enterprise/latest/release-notes/index.html#whats-new:~:text=Newly%20supported%20graphics,Grace%20Hopper%20Superchip).

Step 1: Download the Driver Bundle

Firstly, obtain the NVIDIA AI Enterprise 5.0 driver bundle from the NVIDIA License portal. Ensure compatibility with your ESXi version and GPU model. In this case, we are using NVIDIA-AI-Enterprise-vSphere-8.0-550.54.16-550.54.15-551.78.zip.

Step 2: Upload the Driver Bundle

Unzip the downloaded driver bundle and upload it to a shared datastore within your vSphere cluster. Utilize the vSphere client’s File browser option for seamless uploading.

Step 3: Prepare the Host

  1. Put the ESXi host into maintenance mode to ensure uninterrupted installation.
  2. SSH into the ESXi host for command-line access.

Step 4: Install the NVIDIA Enterprise AI Driver

Execute the following command to install the NVIDIA Enterprise AI driver on the ESXi host:

esxcli software vib install -d /vmfs/volumes/<datastore>/path/to/driver_bundle.zip

Replace and path/to/driver_bundle.zip with the appropriate datastore name and path. After installation, you should receive a confirmation message indicating successful installation.

[root@esx03:~]  esxcli software vib install -d /vmfs/volumes/nfs-ds-1/nvidia-550/NVD-AIE-800_550.54.16-1OEM.800.1.0.20613240_23471877.zip
Installation Result
   Message: Operation finished successfully.
   VIBs Installed: NVD_bootbank_NVD-AIE_ESXi_8.0.0_Driver_550.54.16-1OEM.800.1.0.20613240
   VIBs Removed:
   VIBs Skipped:
   Reboot Required: false
   DPU Results:

Step 5: Reboot the Server

Reboot the ESXi host to finalize the driver installation process.

Step 6: Verify Installation

Upon reboot, ensure that the NVIDIA vGPU software package is correctly installed and loaded. Check for the NVIDIA kernel driver by running the following command:

[root@esx03:~]vmkload_mod -l |grep nvidia 

nvidia                   58420

You should see the NVIDIA kernel driver listed among the loaded modules.

Additionally, confirm successful communication between the NVIDIA kernel driver and the physical GPUs by running:

nvidia-smi
[root@esx03:~] nvidia-smi
Tue Mar 26 14:47:43 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.16              Driver Version: 550.54.16      CUDA Version: N/A      |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA H100 80GB HBM3          On  |   00000000:19:00.0 Off |                    0 |
| N/A   35C    P0             77W /  700W |       0MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA H100 80GB HBM3          On  |   00000000:3B:00.0 Off |                    0 |
| N/A   31C    P0             77W /  700W |       0MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA H100 80GB HBM3          On  |   00000000:4C:00.0 Off |                    0 |
| N/A   30C    P0             76W /  700W |       0MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA H100 80GB HBM3          On  |   00000000:5D:00.0 Off |                    0 |
| N/A   34C    P0             75W /  700W |       0MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   4  NVIDIA H100 80GB HBM3          On  |   00000000:9B:00.0 Off |                    0 |
| N/A   34C    P0             78W /  700W |       0MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   5  NVIDIA H100 80GB HBM3          On  |   00000000:BB:00.0 Off |                    0 |
| N/A   31C    P0             77W /  700W |       0MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   6  NVIDIA H100 80GB HBM3          On  |   00000000:CB:00.0 Off |                    0 |
| N/A   33C    P0             78W /  700W |       0MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   7  NVIDIA H100 80GB HBM3          On  |   00000000:DB:00.0 Off |                    0 |
| N/A   31C    P0             78W /  700W |       0MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+
[root@az1-wld-esx01:~]

This command should display information about the installed NVIDIA GPUs, confirming proper functionality.

Step 7: Repeat for Cluster Hosts

Repeat the aforementioned steps for all hosts within your vSphere cluster to ensure consistent GPU support across the environment.