In the sphere of cloud management, ensuring uninterrupted service is of paramount importance. However, challenges can emerge, affecting the smooth operation of services. Recently, a noteworthy issue surfaced with a customer – the ‘VMware Cloud Director service crashing when Container Service Extension communicates with VCD.’ This article delves into the symptoms, causes, and, most crucially, the solution to address this challenge.
It’s important to note that the workaround provided here is not an official recommendation from VMware. It should be applied at your discretion. We anticipate that VMware will release an official KB addressing this issue in the near future. The product versions under discussion in this article are VCD 10.4.2 and CSE 4.1.0.
Symptoms:
The VCD service crashes on VCD cells when traffic from CSE servers is permitted.
The count of ‘BEHAVIOR_INVOCATION’ operation in VCD DB is quite high (more than 10000).
vcloud=# select count(*) from jobs where operation = 'BEHAVIOR_INVOCATION';
count
--------
385151
(1 row)
In the logs, you may find the following events added in cell.log:
Successfully verified transfer spooling area: VfsFile[fileObject=file:///opt/vmware/vcloud-director/data/transfer]
Cell startup completed in 1m 39s
java.lang.OutOfMemoryError: Java heap space
Dumping heap to /opt/vmware/vcloud-director/logs/java_pid14129.hprof ...
Dump file is incomplete: No space left on device
log4j:ERROR Failed to flush writer,
java.io.IOException: No space left on device
at java.base/java.io.FileOutputStream.writeBytes(Native Method)
Cause:
The root cause of this issue lies in VCD generating memory heap dumps due to an ‘OutOfMemoryError’ due to this, in turn, leads to the storage space being exhausted and ultimately results in the VCD service crashing.
Resolution:
The good news is that the VMware has identified this as a bug within VCD and plans to address it in the upcoming update releases of VCD. While we eagerly await this update, the team has suggested a workaround in case you encounter this issue:
SSH into each VCD cell.
Check the “/opt/vmware/vcloud-director/logs” directory for java heap dump files (.hprof) on each cell.
vi /etc/systemd/system/cse.service
[Unit]
Description=Container Service Extension for VMware Cloud Director
[Service]
ExecStart=/opt/vmware/cse/cse.sh
User=root
WorkingDirectory=/opt/vmware/cse
Type=simple
Restart=always
[Install]
WantedBy=default.target
# systemctl status cse.service
cse.service - Container Service Extension for VMware Cloud Director
Loaded: loaded (/etc/systemd/system/cse.service; enabled; vendor preset: disabled)
Active: active (running) since Wed 2021-11-24 14:43:56 +01; 1min 9s ago
Main PID: 770 (bash)
CGroup: /system.slice/cse.service
├─770 bash /opt/vmware/cse/cse.sh
└─775 /usr/local/bin/python3.7 /usr/local/bin/cse run
Nov 24 14:44:06 cse01.lab.com cse.sh[770]: Validating CSE installation according to config file
Nov 24 14:44:06 cse.sh[770]: MQTT extension and API filters found
Nov 24 14:44:06 cse01.lab.com cse.sh[770]: Found catalog 'cse-site1-k8s'
Nov 24 14:44:06 cse01.lab.com cse.sh[770]: CSE installation is valid
Nov 24 14:44:06 cse01.lab.com cse.sh[770]: Started thread 'MessageConsumer' (140229531580160)
Nov 24 14:44:06 cse01.lab.com l cse.sh[770]: Started thread 'ConsumerWatchdog' (140229523187456)
Nov 24 14:44:06 cse01.lab.com cse.sh[770]: Container Service Extension for vCloud Director
Nov 24 14:44:06 cse01.lab.com cse.sh[770]: Server running using config file: /opt/vmware/cse/encrypted-config.yaml
Nov 24 14:44:06 cse01.lab.com cse.sh[770]: Log files: /root/.cse-logs/cse-server-info.log, /root/.cse-logs/cse-server-debug.log
Nov 24 14:44:06 cse01.lab.com cse.sh[770]: waiting for requests (ctrl+c to close)
I’ve downloaded Ubuntu 2004 Kubernetes v1.21.2 OVA since that’s the lates available version. File Name : ubuntu-2004-kube-v1.21.2+vmware.1-tkg.1-7832907791984498322.ova
Step2: Import TKG OVA to VCD Catalog
Upload the downloaded OVA to the CSE server. Use the following command to import the OVA in Catalog.
# cse template import -c encrypted-config.yaml -F ubuntu-2004-kube-v1.21.2+vmware.1-tkg.1-7832907791984498322.ova
Required Python version: >= 3.7.3
Installed Python version: 3.7.12 (default, Nov 23 2021, 15:49:55)
[GCC 4.8.5 20150623 (Red Hat 4.8.5-44)]
Password for config file decryption:
Decrypting 'encrypted-config.yaml'
Validating config file 'encrypted-config.yaml'
InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised.
Connected to vCloud Director (vcd.lab.com:443)
Connected to vCenter Server 'demovc.local' as '[email protected]' (demovc.local)
Config file 'encrypted-config.yaml' is valid
Uploading 'ubuntu-2004-kube-v1.21.2+vmware.1-tkg.1-7832907791984498322' to catalog 'cse-site1-k8s'
Uploaded 'ubuntu-2004-kube-v1.21.2+vmware.1-tkg.1-7832907791984498322' to catalog 'cse-site1-k8s'
Writing metadata onto catalog item ubuntu-2004-kube-v1.21.2+vmware.1-tkg.1-7832907791984498322.
Successfully imported TKGm OVA.
Step3: Restart CSE service.
I assume you’ve configured CSE to run as service. If yest restart the service.
Step4: Confirm TKG is available as option for Kubernetes Runtime
Login to the tenant portal and navigate to More > Kubernetes Container Clusters.
In CSE 3.1.1, delete operation on a cluster (Native or TKG) that is in an error state (RDE.state = RESOLUTION_ERROR (or) status.phase = :FAILED), may fail with Bad request (400) or the Delete process will be stuck in ‘DELETEIN_PROGRESS’ state. The steps are given below to resolve the issue.
Step1: Assign API explorer privilege to the CSE Service Account.
Login to VCD Provider portal as Administrator.
Edit the CSE Service Role.
Navigate to Administration > Provider Access Control > Roles > CSE Service Role.
In the tenant portal check if there’re any stale vApp entries for the failed clusters. If so, please delete them.
Login to VCD Provider portal with the CSE service account which has CSE Service Role assigned. Open API Explorer. Click on GET in difinedEntity section.
Click on TryitOut In Description, provide the cluster UID from last step. In the output we can see the state as PRE_CREATED.
Step3: Run the POST call resolve to resolve
Select the POST call from definedEntity section.
/1.0.0/entities/{id}/resolve Validates the defined entity against the entity type schema.
Provide the cluster ID and run the call. The state will be changed to RESOLVED.
Step4: Run the DELETE call to delete RDE.
Povide the cluser ID and ‘false’ as value for inovkeHooks.
Please check and confirm the failed Cluster is deleted now.
#vcd cse cluster list
InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised.
Name Org Owner VDC K8s Runtime K8s Version Status
-------- -------------- -------- ------------- ------------- --------------------- ------------------
tkg CSE-Site1-Test orgadmin CSE-TEST-OVDC TKGm TKGm v1.21.2+vmware.1 DELETE:IN_PROGRESS
tkgtest CSE-Site1-Test orgadmin CSE-TEST-OVDC TKGm TKGm v1.21.2+vmware.1 DELETE:IN_PROGRESS
tkg-test CSE-Site1-Test orgadmin CSE-TEST-OVDC TKGm TKGm v1.21.2+vmware.1 DELETE:IN_PROGRESS
Please find the steps to deploy Container Service Extension 3.1.1.
Step 1: Deploy CentOS 7 VM
Selected CentOS 7 as the Operating System for CSE server. CentOS 7 has higher EOL than CentOS 8. You can find the installations steps for CentOS 7 here.
Please find more details on CentOS releases below.
Kindly ensure following configurations are done in CSE VM.
Configure DNS.
Configure NTP.
Configure SSH.
SSH access for root user is enabled.
Please note the following network connections are required for rest of the configurations.
Access to VCD URL (https) from CSE Server.
The Internet access from CSE server.
Step 2: Take a snapshot of CSE server VM.
It’s recommended to take a snapshot of CSE server before continuing with Python installation. It’s an optional step.
Step 3: Install Python 3.7.3 or greater
Install python 3.7.3 or greater in 3.7.x series. Please note that python 3.8.0 and above is not supported (ref: CSE doc) The built-in python version in CentOS 7 is 2.7. So, we’ve to install the latest in 3.7.x series, at the moment version 3.7.12 is the latest. Please follow the below steps to install Python.
yum update -y
yum install -y yum-utils
yum groupinstall -y development
yum install -y gcc openssl-devel bzip2-devel libffi-devel zlib-devel xz-devel
#Install sqlite3
cd /tmp/
curl -O https://www.sqlite.org/2020/sqlite-autoconf-3310100.tar.gz
tar xvf sqlite-autoconf-3310100.tar.gz
cd sqlite-autoconf-3310100/
./configure
make install
# Install Python
cd /tmp/
curl -O https://www.python.org/ftp/python/3.7.12/Python-3.7.12.tgz
tar -xvf Python-3.7.12.tgz
cd Python-3.7.12
./configure --enable-optimizations
make altinstall
alternatives --install /usr/bin/python3 python3 /usr/local/bin/python3.7 1
alternatives --install /usr/bin/pip3 pip3 /usr/local/bin/pip3.7 1
alternatives --list
# Check Python and pip3 versions
python3 --version
pip3 --version
Step 4: Install vcd-cli
# Install and verify vcd-cli
pip3 install vcd-cli
vcd version
vcd-cli, VMware vCloud Director Command Line Interface, 24.0.1
Step 5: Install CSE
# Install and verify cse
pip3 install container-service-extension
cse version
CSE, Container Service Extension for VMware vCloud Director, version 3.1.1
Step 7: Create CSE Service Role for CSE server management
[root@test ~]# cse create-service-role <vcd fqdn> -s
Username for System Administrator: administrator
Password for administrator:
Connecting to vCD: <vcd fqdn>
Connected to vCD as system administrator: administrator
Creating CSE Service Role...
Successfully created CSE Service Role
Step 7: Create service account for CSE in VCD
Create a Service Account in VCD with the role ‘CSE Service Role’
Step 8: Create service account for CSE in vCenter
Create new role in vCenter with Power User + Guest Operations privilege. Assign the role to the service account for CSE.
Clone ‘Virtual Machine Power User (sample) role
Edit role
Select Virtual machine > Guest operations.
Step 9: Create a sample CSE config file and update it.
It will take a while to complete the download of template, be patient.
Downloading file from 'https://cloud-images.ubuntu.com/releases/xenial/release-20180418/ubuntu-16.04-server-cloudimg-amd64.ova' to 'cse_cache/ubuntu-16.04-server-cloudimg-amd64.ova'...
Download complete
Uploading 'ubuntu-16.04-server-cloudimg-amd64.ova' to catalog 'cse-site1-k8s'
Uploaded 'ubuntu-16.04-server-cloudimg-amd64.ova' to catalog 'cse-site1-k8s'
Deleting temporary vApp 'ubuntu-16.04_k8-1.21_weave-2.8.1_temp'
Creating vApp 'ubuntu-16.04_k8-1.21_weave-2.8.1_temp'
Found data file: /root/.cse_scripts/2.0.0/ubuntu-16.04_k8-1.21_weave-2.8.1_rev1/init.sh
Created vApp 'ubuntu-16.04_k8-1.21_weave-2.8.1_temp'
Customizing vApp 'ubuntu-16.04_k8-1.21_weave-2.8.1_temp', vm 'ubuntu-1604-k8s1212-weave281-vm'
Found data file: /root/.cse_scripts/2.0.0/ubuntu-16.04_k8-1.21_weave-2.8.1_rev1/cust.sh
Waiting for guest tools, status: "vm='vim.VirtualMachine:vm-2296', status=guestToolsNotRunning
Waiting for guest tools, status: "vm='vim.VirtualMachine:vm-2296', status=guestToolsNotRunning
Waiting for guest tools, status: "vm='vim.VirtualMachine:vm-2296', status=guestToolsNotRunning
Waiting for guest tools, status: "vm='vim.VirtualMachine:vm-2296', status=guestToolsRunning
.....
......
......
waiting for process 1611 on vm 'vim.VirtualMachine:vm-2296' to finish (1)
waiting for process 1611 on vm 'vim.VirtualMachine:vm-2296' to finish (2)
waiting for process 1611 on vm 'vim.VirtualMachine:vm-2296' to finish (3)
waiting for process 1611 on vm 'vim.VirtualMachine:vm-2296' to finish (4)
waiting for process 1611 on vm 'vim.VirtualMachine:vm-2296' to finish (5)
waiting for process 1611 on vm 'vim.VirtualMachine:vm-2296' to finish (6)
waiting for process 1611 on vm 'vim.VirtualMachine:vm-2296' to finish (7)
waiting for process 1611 on vm 'vim.VirtualMachine:vm-2296' to finish (8)
...
...
..
/etc/kernel/postinst.d/x-grub-legacy-ec2:
Searching for GRUB installation directory ... found: /boot/grub
Searching for default file ... found: /boot/grub/default
Testing for an existing GRUB menu.lst file ... found: /boot/grub/menu.lst
Searching for splash image ... none found, skipping ...
Found kernel: /boot/vmlinuz-4.4.0-119-generic
Found kernel: /boot/vmlinuz-4.4.0-210-generic
Found kernel: /boot/vmlinuz-4.4.0-119-generic
Updating /boot/grub/menu.lst ... done
/etc/kernel/postinst.d/zz-update-grub:
Generating grub configuration file ...
Found linux image: /boot/vmlinuz-4.4.0-210-generic
Found initrd image: /boot/initrd.img-4.4.0-210-generic
Found linux image: /boot/vmlinuz-4.4.0-119-generic
Found initrd image: /boot/initrd.img-4.4.0-119-generic
done
customization completed
Customized vApp 'ubuntu-16.04_k8-1.21_weave-2.8.1_temp', vm 'ubuntu-1604-k8s1212-weave281-vm'
Creating K8 template 'ubuntu-16.04_k8-1.21_weave-2.8.1_rev1' from vApp 'ubuntu-16.04_k8-1.21_weave-2.8.1_temp'
Shutting down vApp 'ubuntu-16.04_k8-1.21_weave-2.8.1_temp'
Successfully shut down vApp 'ubuntu-16.04_k8-1.21_weave-2.8.1_temp'
Capturing template 'ubuntu-16.04_k8-1.21_weave-2.8.1_rev1' from vApp 'ubuntu-16.04_k8-1.21_weave-2.8.1_temp'
Created K8 template 'ubuntu-16.04_k8-1.21_weave-2.8.1_rev1' from vApp 'ubuntu-16.04_k8-1.21_weave-2.8.1_temp'
Successfully tagged template ubuntu-16.04_k8-1.21_weave-2.8.1_rev1 with placement policy native.
Deleting temporary vApp 'ubuntu-16.04_k8-1.21_weave-2.8.1_temp'
Deleted temporary vApp 'ubuntu-16.04_k8-1.21_weave-2.8.1_temp'
Step 12: Confirm the template is available in CSE catalog
Login to CSE Tenant portal. Navigate to the Libraries > Catalogs > vApp Templates. We can see the newly created K8S upstream template.
Step 13: Enable Organizations for Native deployments.
The provider must explicitly enable organizational virtual datacenter(s) to host native deployments, by running the command: vcd cse ovdc enable.
vcd login <vcd> system administrator -i
InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised.
Password:
administrator logged in, org: 'system', vdc: ''
# vcd cse ovdc enable <orgvdc> -n -o <organization>
# vcd cse ovdc enable TEST-OVDC -n -o Site1-Test
InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised.
OVDC Update: Updating OVDC placement policies
task: 10e70b37-5aa6-4cf9-b437-ef478bd9f06a, Operation success, result: success
Step 14: Check Create New Native Cluster is available now
Login to the VCD Tenant portal and navigate to More > Kubernetes Container Clusters. Click on New.
We can see the option to ‘Create New Native Cluster’.
Step 15: Publish Right Bundle ‘cse:nativeCluster Entitlement’
The following article has details on differences between right bundle and roles.
Please find the steps to upgrade VMware Cloud Director App Launchpad from version 2.0 to 2.1
Download VMware Cloud Director App Launchpad 2.1 RPM package from here.
Upload it to the App Launchpad VM.
Open an SSH connection to the App Launchpad VM and log in as root.
Upgrade the RPM package.
[root@test ~]# rpm -U vmware-alp-2.1.0-18834930.x86_64.rpm
warning: vmware-alp-2.1.0-18834930.x86_64.rpm: Header V3 RSA/SHA1 Signature, key ID 001e5cc9: NOKEY
Upgrading...
Execute 'alp upgrade' to upgrade ...
Append the excute permission to the existing logs...
5. Run the following command to upgrade App Launchpad.
[root@test ~]# alp upgrade --admin-user administrator@system --admin-pass 'passwd'
Upgraded the plugin of App Launchpad successfully.
Upgraded the management service successfully.
[Upgrade Task]
CREATE_ENTITY_TYPE_CATALOG_INFO : true
MIGRATE_CATALOGS : true
CREATE_ENTITY_TYPE_SIZING_TEMPLATE : true
MIGRATE_LEGACY_SIZING_TEMPLATES : true
CREATE_ENTITY_TYPE_MARKETPLACE_BANNER : true
CREATE_ENTITY_TYPE_ORG_METRICS : true
UPGRADE_SERVICE_ROLE : true
6. Restart alp service and confirm its running.
[root@test~]# systemctl restart alp
[root@test ~]# systemctl status alp
● alp.service - VMware ALP Management Service
Loaded: loaded (/usr/lib/systemd/system/alp.service; enabled; vendor preset: disabled)
Active: active (running) since Thu 2021-11-18 11:46:14 +01; 14s ago
Main PID: 29334 (java)
CGroup: /system.slice/alp.service
└─29334 java -jar /opt/vmware/alp/alp.jar --logging.path=log
Nov 18 11:46:14 bd1-srp-al01.acs.local systemd[1]: Stopped VMware ALP Management Service.
Nov 18 11:46:14 bd1-srp-al01.acs.local systemd[1]: Started VMware ALP Management Service.
7. Diagnose deployment errors by running the /opt/vmware/alp/bin/diagnose executable file.
The diagnose tool verifies that the services are up and running and that all configuration requirements are met.
[root@test ~]# /opt/vmware/alp/bin/diagnose
Step 1: System diagnose
--------------------------------------------------------------------------------
- App Launchpad service is initialized.
Step 2: Cloud Director diagnose
--------------------------------------------------------------------------------
- Service Account for App Launchpad is good.
- App Launchpad's extension is ready.
Step 3: MQTT diagnose
--------------------------------------------------------------------------------
- Cloud Director MQTT for extensibility is ready.
Step 4: Integration diagnose
--------------------------------------------------------------------------------
- App Launchpad API is up, and version is 2.1.0-18834930.
Step 5: App Launchpad diagnose
--------------------------------------------------------------------------------
- App Launchpad service is listening on port 8086.
8. Confirm the ALP version.
[root@test ~]# alp
NAME:
alp - The Cloud Director App Launchpad
(ALP) Command-line tool
USAGE:
alp <subcommand> [flags]
VERSION:
'2.1.0-18834930'