VMware Cloud Director Service Crashes During CSE Communication with VCD

Introduction

In the sphere of cloud management, ensuring uninterrupted service is of paramount importance. However, challenges can emerge, affecting the smooth operation of services. Recently, a noteworthy issue surfaced with a customer – the ‘VMware Cloud Director service crashing when Container Service Extension communicates with VCD.’ This article delves into the symptoms, causes, and, most crucially, the solution to address this challenge.

It’s important to note that the workaround provided here is not an official recommendation from VMware. It should be applied at your discretion. We anticipate that VMware will release an official KB addressing this issue in the near future. The product versions under discussion in this article are VCD 10.4.2 and CSE 4.1.0.

Symptoms:

  • The VCD service crashes on VCD cells when traffic from CSE servers is permitted.
  • The count of ‘BEHAVIOR_INVOCATION’ operation in VCD DB is quite high (more than 10000).
vcloud=# select count(*) from jobs where operation = 'BEHAVIOR_INVOCATION';
 count
--------
 385151
(1 row)
  • In the logs, you may find the following events added in cell.log:
Successfully verified transfer spooling area: VfsFile[fileObject=file:///opt/vmware/vcloud-director/data/transfer] 
Cell startup completed in 1m 39s
java.lang.OutOfMemoryError: Java heap space
Dumping heap to /opt/vmware/vcloud-director/logs/java_pid14129.hprof ...
Dump file is incomplete: No space left on device
log4j:ERROR Failed to flush writer,
java.io.IOException: No space left on device
at java.base/java.io.FileOutputStream.writeBytes(Native Method)

Cause:

The root cause of this issue lies in VCD generating memory heap dumps due to an ‘OutOfMemoryError’ due to this, in turn, leads to the storage space being exhausted and ultimately results in the VCD service crashing.

Resolution:

The good news is that the VMware has identified this as a bug within VCD and plans to address it in the upcoming update releases of VCD. While we eagerly await this update, the team has suggested a workaround in case you encounter this issue:

  1. SSH into each VCD cell.
  2. Check the “/opt/vmware/vcloud-director/logs” directory for java heap dump files (.hprof) on each cell.

cd /opt/vmware/vcloud-director/logs

  1. Remove the files with the “.hprof” extension.
[ /opt/vmware/vcloud-director/logs ]# rm java_xxxxx.hprof
  1. Connect to the VCD database:
   sudo -i -u postgres psql vcloud
  1. Delete records of the operations ‘BEHAVIOR_INVOCATION’ from the ‘jobs’ table:
   vcloud=# delete from jobs where operation = 'BEHAVIOR_INVOCATION';
  1. Perform a service restart on all the VCD cells serially:
   service vmware-vcd restart

By following these steps, you can mitigate the issue and keep your VCD service running smoothly until the official bug fix is released in VCD.

Unable to connect VMware Cloud director 10.4 through PowerCLI

While attempting to connect to VMware Cloud Director 10.4 using PowerCLI, I encountered the error message “The server returned the following: NotAcceptable: ”.”

 PS C:\Users\sreejesh> Connect-CIServer -Server vcd.sreejesh.lab -Credential (Get-Credential)
Connect-CIServer : 3/27/2023 8:01:29 AM	Connect-CIServer		Unable to connect to vCloud Server 'https://vcd.sreejesh.lab:443/api/'. The server returned the following: NotAcceptable: ''.	
At line:1 char:1
+ Connect-CIServer -Server https://vcd.sreejesh.lab  ...
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    + CategoryInfo          : NotSpecified: (:) [Connect-CIServer], CIException
    + FullyQualifiedErrorId : Cloud_ConnectivityServiceImpl_ConnectCloudServer_ConnectError,VMware.VimAutomation.Cloud.Commands.Cmdlets.ConnectCIServer 

This issue is a known limitation with PowerCLI versions prior to 13.0. The error occurs because PowerCLI versions earlier than 13.0.0 do not support VMware Cloud Director API versions greater than 33.0. To resolve this issue, the solution is to install the latest version of PowerCLI, version 13.0. If encountering this issue, first confirm that the current PowerCLI version is less than 13.0, and if so, uninstall and reinstall with the latest version to resolve the issue.

 
PS C:\Users\sreejesh> Get-PowerCLIVersion
WARNING: The cmdlet "Get-PowerCLIVersion" is deprecated. Please use the 'Get-Module' cmdlet instead.

PowerCLI Version
----------------
   VMware.PowerCLI 12.7.0 build 20091289
---------------
Component Versions
---------------
   VMware Common PowerCLI Component 12.7 build 20067789
   VMware Cis Core PowerCLI Component PowerCLI Component 12.6 build 19601368
   VMware VimAutomation VICore Commands PowerCLI Component PowerCLI Component 12.7 build 20091293
   VMware VimAutomation Storage PowerCLI Component PowerCLI Component 12.7 build 20091292
   VMware VimAutomation Vds Commands PowerCLI Component PowerCLI Component 12.7 build 20091295
   VMware VimAutomation Cloud PowerCLI Component PowerCLI Component 12.0 build 15940183 


PS C:\Users\sreejesh> Remove-Module -Name VMware.PowerCLI -Force
PS C:\Users\sreejesh> Install-Module -Name VMware.PowerCLI 

PS C:\Users\sreejesh> Get-PowerCLIVersion

PowerCLI Version
----------------
   VMware.VimAutomation.Core 13.0.0 build 20797821
---------------
Component Versions
---------------
   VMware Common PowerCLI Component 13.0 build 20797081
   VMware Cis Core PowerCLI Component PowerCLI Component 13.0 build 20797636
   VMware VimAutomation VICore Commands PowerCLI Component PowerCLI Component 13.0 build 20797821

How to validate TKGm Cluster?

Please find the steps to validate a TKGm cluster deployed through VMware Container Service Extension.

Step 1 : Download kubeconfig file

  • Download the Kubeconfig file to a windows machine which has access to the Native Kubernetes cluster.
  • Create folder .kube under $HOME.

$HOME\.kube

  • Copy the configfile dowloaded to .kube folder.
  • Rename the file to ‘config’ without any extensions.

Step 2 : Download kubectl

  • Download Kubectl for Windows from
https://dl.k8s.io/release/v1.22.0/bin/windows/amd64/kubectl.exe
  • Create folder $HOME\kubectl and copy kubectl.ext to the folder. Add the folder to the ‘Path’ User variable in Environment Variables.

Run kubectl

Step 3: Run a ‘hello world’ application in the cluster.

Follow the steps from following article to deploy a Hello World applicaiton in the K8S cluster created.

Exposing an External IP Address to Access an Application in a Cluster | Kubernetes

Note: In the following command use NodePort instead of LoadBalancer

kubectl expose deployment hello-world --type=LoadBalancer --name=my-service

How to run VMware Container Service Extension (CSE) as Linux Service?

After installing CSE please follow the steps below to run it as a service.

Create cse.sh file

Create cse.service file. You can copy the following code or create new one based on following link.
container-service-extension/cse.sh at master · vmware/container-service-extension (github.com)

# vi /opt/vmware/cse/cse.sh
#!/usr/bin/env bash
export CSE_CONFIG=/opt/vmware/cse/encrypted-config.yaml
export CSE_CONFIG_PASSWORD=<passwd>
cse run

Copy encrypted-config.yaml to /opt/vmware/cse directory.

Change the file permission

chmod +x /opt/vmware/cse/cse.sh

Create cse.service file

Create cse.service file. You can copy the following code or create new one based on following link.
container-service-extension/cse.service at master · vmware/container-service-extension (github.com)

vi /etc/systemd/system/cse.service
[Unit]
Description=Container Service Extension for VMware Cloud Director

[Service]
ExecStart=/opt/vmware/cse/cse.sh
User=root
WorkingDirectory=/opt/vmware/cse
Type=simple
Restart=always

[Install]
WantedBy=default.target

Enable and start the service

# systemctl enable cse.service
# systemctl start cse.service

Check the service status

# systemctl status cse.service
  cse.service - Container Service Extension for VMware Cloud Director
   Loaded: loaded (/etc/systemd/system/cse.service; enabled; vendor preset: disabled)
   Active: active (running) since Wed 2021-11-24 14:43:56 +01; 1min 9s ago
 Main PID: 770 (bash)
   CGroup: /system.slice/cse.service
           ├─770 bash /opt/vmware/cse/cse.sh
           └─775 /usr/local/bin/python3.7 /usr/local/bin/cse run

Nov 24 14:44:06 cse01.lab.com cse.sh[770]: Validating CSE installation according to config file
Nov 24 14:44:06  cse.sh[770]: MQTT extension and API filters found
Nov 24 14:44:06 cse01.lab.com cse.sh[770]: Found catalog 'cse-site1-k8s'
Nov 24 14:44:06 cse01.lab.com  cse.sh[770]: CSE installation is valid
Nov 24 14:44:06 cse01.lab.com cse.sh[770]: Started thread 'MessageConsumer' (140229531580160)
Nov 24 14:44:06 cse01.lab.com l cse.sh[770]: Started thread 'ConsumerWatchdog' (140229523187456)
Nov 24 14:44:06 cse01.lab.com  cse.sh[770]: Container Service Extension for vCloud Director
Nov 24 14:44:06 cse01.lab.com  cse.sh[770]: Server running using config file: /opt/vmware/cse/encrypted-config.yaml
Nov 24 14:44:06 cse01.lab.com  cse.sh[770]: Log files: /root/.cse-logs/cse-server-info.log, /root/.cse-logs/cse-server-debug.log
Nov 24 14:44:06 cse01.lab.com  cse.sh[770]: waiting for requests (ctrl+c to close)

Upgrade vRealize Operations Tenant App to 8.6 Using the Offline Upgrade Image

In this blog post, I am going to review the process of updating a vRealize Oprations Tenant App for VMware Cloud Director 2.6 virtual appliance to version 8.6.

  1. First, Download vROps Tenant App 8.6 offline upgrade ISO from VMware marketplace. The Upgrade bundle is available in following URL > Resource & Support > ‘Offline Upgrade: vRealize Operations TA 8.6’

    https://marketplace.cloud.vmware.com/services/details/vrealize-operations-tenant-app-for-vmware-cloud-director-2-61211?slug=true
  2. Take a snapshot of the Tenant App appliance.
  3. Connect the ISO to the Tenant App appliance. You can use ‘Client Device’, ‘Host Device’, ‘Datastore ISO file’ or ‘Content Library ISO FIle’ for mounting the ISO.
  4. n a Web browser, log in to the virtual appliance management interface (VAMI). The URL for the VAMI is https:// tenantapp_appliance_address:5480
  5. Click the Update tab.
  6. Click Settings, select Use CDROM Updates, and click Save Settings.
  7. Click Status and click Check Updates. The appliance version appears in the list of available updates.

8. Click Install Updates and click OK. It will take a while to complete the upgrade.

9. After the updates install, click the System tab and click Reboot
10. After reboot confirm the Tenant App version is 8.6.

How to delete the failed TKGm or Native k8s cluster in CSE?

In CSE 3.1.1, delete operation on a cluster (Native or TKG) that is in an error state (RDE.state = RESOLUTION_ERROR (or) status.phase = :FAILED), may fail with Bad request (400) or the Delete process will be stuck in ‘DELETEIN_PROGRESS’ state. The steps are given below to resolve the issue.

Step1: Assign API explorer privilege to the CSE Service Account.

Login to VCD Provider portal as Administrator.

Edit the CSE Service Role.

Navigate to Administration > Provider Access Control > Roles > CSE Service Role.

In the tenant portal check if there’re any stale vApp entries for the failed clusters. If so, please delete them.

Step2: Get the failed cluster ID through vcd cli

# vcd cse cluster list
Name        Org             Owner     VDC            K8s Runtime    K8s Version            Status
----------  --------------  --------  -------------  -------------  ---------------------  ------------------
test-tkg-1  CSE-Site1-Test  orgadmin  CSE-TEST-OVDC  TKGm           TKGm v1.21.2+vmware.1  DELETE:IN_PROGRESS
tkg         CSE-Site1-Test  orgadmin  CSE-TEST-OVDC  TKGm           TKGm v1.21.2+vmware.1  DELETE:IN_PROGRESS
tkgtest     CSE-Site1-Test  orgadmin  CSE-TEST-OVDC  TKGm           TKGm v1.21.2+vmware.1  DELETE:IN_PROGRESS
tkg-test    CSE-Site1-Test  orgadmin  CSE-TEST-OVDC  TKGm           TKGm v1.21.2+vmware.1  DELETE:IN_PROGRESS
tkg-test3   CSE-Site1-Test  orgadmin  CSE-TEST-OVDC  TKGm           TKGm v1.21.2+vmware.1  DELETE:IN_PROGRESS

# vcd cse cluster info tkg-test3 | grep uid
  uid: urn:vcloud:entity:cse:nativeCluster:9364bf18-0faa-49ce-8be7-7e92af692d1b

Step3: Run GET call to check the status

Login to VCD Provider portal with the CSE service account which has CSE Service Role assigned.
Open API Explorer.
Click on GET in difinedEntity section.

Click on TryitOut
In Description, provide the cluster UID from last step.
In the output we can see the state as PRE_CREATED.

Step3: Run the POST call resolve to resolve

Select the POST call from definedEntity section.

/1.0.0/entities/{id}/resolve
Validates the defined entity against the entity type schema.

Provide the cluster ID and run the call. The state will be changed to RESOLVED.


Step4: Run the DELETE call to delete RDE.

Povide the cluser ID and ‘false’ as value for inovkeHooks.

Please check and confirm the failed Cluster is deleted now.

#vcd cse cluster list
InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised.
Name      Org             Owner     VDC            K8s Runtime    K8s Version            Status
--------  --------------  --------  -------------  -------------  ---------------------  ------------------
tkg       CSE-Site1-Test  orgadmin  CSE-TEST-OVDC  TKGm           TKGm v1.21.2+vmware.1  DELETE:IN_PROGRESS
tkgtest   CSE-Site1-Test  orgadmin  CSE-TEST-OVDC  TKGm           TKGm v1.21.2+vmware.1  DELETE:IN_PROGRESS
tkg-test  CSE-Site1-Test  orgadmin  CSE-TEST-OVDC  TKGm           TKGm v1.21.2+vmware.1  DELETE:IN_PROGRESS


Upgrade VMware Cloud Director App Launchpad from 2.0 to 2.1

Please find the steps to upgrade VMware Cloud Director App Launchpad from version 2.0 to 2.1

  1. Download VMware Cloud Director App Launchpad 2.1 RPM package from here.
  2. Upload it to the App Launchpad VM.
  3. Open an SSH connection to the App Launchpad VM and log in as root.
  4. Upgrade the RPM package.
[root@test ~]# rpm -U vmware-alp-2.1.0-18834930.x86_64.rpm
warning: vmware-alp-2.1.0-18834930.x86_64.rpm: Header V3 RSA/SHA1 Signature, key ID 001e5cc9: NOKEY
Upgrading...

Execute 'alp upgrade' to upgrade ...

  Append the excute permission to the existing logs...

5. Run the following command to upgrade App Launchpad.

[root@test ~]# alp upgrade --admin-user administrator@system --admin-pass 'passwd'
Upgraded the plugin of App Launchpad successfully.

Upgraded the management service successfully.
  [Upgrade Task]
     CREATE_ENTITY_TYPE_CATALOG_INFO : true
     MIGRATE_CATALOGS : true
     CREATE_ENTITY_TYPE_SIZING_TEMPLATE : true
     MIGRATE_LEGACY_SIZING_TEMPLATES : true
     CREATE_ENTITY_TYPE_MARKETPLACE_BANNER : true
     CREATE_ENTITY_TYPE_ORG_METRICS : true
     UPGRADE_SERVICE_ROLE : true

6. Restart alp service and confirm its running.

[root@test~]# systemctl restart alp
[root@test ~]# systemctl status alp
● alp.service - VMware ALP Management Service
   Loaded: loaded (/usr/lib/systemd/system/alp.service; enabled; vendor preset: disabled)
   Active: active (running) since Thu 2021-11-18 11:46:14 +01; 14s ago
 Main PID: 29334 (java)
   CGroup: /system.slice/alp.service
           └─29334 java -jar /opt/vmware/alp/alp.jar --logging.path=log

Nov 18 11:46:14 bd1-srp-al01.acs.local systemd[1]: Stopped VMware ALP Management Service.
Nov 18 11:46:14 bd1-srp-al01.acs.local systemd[1]: Started VMware ALP Management Service.

7. Diagnose deployment errors by running the /opt/vmware/alp/bin/diagnose executable file.

The diagnose tool verifies that the services are up and running and that all configuration
requirements are met.

[root@test ~]# /opt/vmware/alp/bin/diagnose

Step 1: System diagnose
--------------------------------------------------------------------------------
- App Launchpad service is initialized.


Step 2: Cloud Director diagnose
--------------------------------------------------------------------------------
- Service Account for App Launchpad is good.
- App Launchpad's extension is ready.


Step 3: MQTT diagnose
--------------------------------------------------------------------------------
- Cloud Director MQTT for extensibility is ready.


Step 4: Integration diagnose
--------------------------------------------------------------------------------
- App Launchpad API is up, and version is 2.1.0-18834930.


Step 5: App Launchpad diagnose
--------------------------------------------------------------------------------
- App Launchpad service is listening on port 8086.


8. Confirm the ALP version.

[root@test ~]# alp
NAME:
        alp - The Cloud Director App Launchpad
        (ALP) Command-line tool

USAGE:
        alp <subcommand> [flags]

VERSION:
        '2.1.0-18834930'