Port 5480: 2018

The other day I decided to upgrade to my 6.5 U1 VCSA to U2. After successfully completing the upgrade in my development lab I moved on to one of the production sites. This worked as expected so did not expected any issues with the remaining site. However this was not the case. After clicking the upgrade button I got a message stating that the "VAMI Upgrade Staging Failed".

Working with VMware support we came to the conclusion that the VCSA's RPM database got corrupted somehow.

To solve the problem I took the following steps:

Snapshot the appliance
Take backup of all the __db files on /var/rpm
rm /var/lib/rpm/__db to remove the db files
rpm -qa, this will ideally recreate the database

After executing the above run the rpm -qa | grep -i rpm command again

This allowed me to run the upgrade process successfully from the VAMI interface

Like most of us who use Nutanix we started off small. Today we have 60+ nodes across several clusters. One of these clusters houses production and development workloads. This has served us well but as the environment continues to grow I want to separate the cluster in a dedicated cluster for production and another one for development. Also, it will give me the opportunity to re-ip the cluster easily, which is a requirement that the security team has tasked me with. Furthermore it will allow me to settle on a certain model node and minimize having a mixture of different nodes.

The current cluster exists out of 19 nodes. 14 of those are used for development workloads and 5 for production workloads. The outcome is to have 4 nodes as dedicated production cluster. Another 4 are getting on a bit but have plenty of space and will become a dedicated backup cluster at our remote site. And the 11 remaining nodes will be added to another existing but dedicated development cluster.

As with everything Nutanix there should be no disruptions but it still made me somewhat nervous as this particular production workload is the most important system we have. You can only do one node at a time and it can be time consuming.

You cannot place host in maintenance mode as the CVM needs to remain powered on so manually migrate VM to other nodes in cluster
There is a chance that DRS will place VM on the node you are trying to evacuate so you may want to change the setting to manual or partial
Run a NCC check to make all is in order.

You want to unmount the datastore from the node and you will find that the datastore will not unmount. Reason for this is that the datastore is being used a heartbeat datastore. You can just disable HA on the cluster. There are several ways you can do the unmounting of the datastores. I prefer to do it via PRISM.

Go to Storage > Table and select the datastore. Click update
Select Mount/Unmount on the following ESX hosts and deselect host in question.
Click OK when prompted.

Now the datastores are unmounted you can proceed with removing the nodes from the cluster

Go to Hardware > Table and select your host.Click remove host.
Click OK when prompted
The removal process will start and you can follow progress under tasks. This is a time consuming process.
Once the process is complete you should see a decrease in available dataspace and the host will be no longer visible.

Before we add the node to the new cluster we want to remove it from the virtual distributed switch (assuming you used one) and remove host from vCenter.

Expanding the cluster is fairly easy through PRISM but I ran into a few issues so to speak. This was due to the fact that I had to re-ip the node while adding it to cluster.

Before starting the process set the ESXi root password to the default nutanix/4u password
In my case I assigned the new IP to the ESXi host as well as the IPMI interface
Under the gear icon click expand cluster. The node you have removed previously should be discovered.
Scroll down and enter new IP

Note: At this point I ran into an error but that was fixed by a 'genesis restart' and start process again. The genesis.out file should give you a good indicator
Select the hypervisor needed and expand cluster.
The process kicks off and took approx. 60 minutes to complete
You should notice on the main PRISM page that you have an extra node and that your storage has increased.
Go to Storage > Storage Container. Select the applicable datastores and moun them to new host. You can do this by selecting datastore and clicking the update link. Select mount/unmount on following ESXi hosts and select IP address of new host.
Last thing to do on the Nutanix cluster is to update licensing. On the original cluster, go to the gear icon > Licensing. Click Update license and generate cluster summary file
In the portal select the cluster UUID and click on actions button. Click reclaim and upload summary file and click generate.
Go back to original cluster and upload the newly generated license file.
On the cluster where you added node, click on the licensing violation banner at top and update license. Generate license and go to support portal once again. Select the correct cluster and choose add node under actions.
Upload the cluster summary file and generate a new license file. On the cluster apply the new license

And that is it! You can now add your node to vCenter and configure as required

Port 5480

Pages

Monday, 25 June 2018

Repairing a corrupt RPM database on VCSA

Thursday, 24 May 2018

Resuming a failed one click hypervisor upgrade

Tuesday, 13 March 2018

Removing nodes from a Nutanix cluster and add them to another one