Sunday 28 February 2016

Check your ring size. Network loss after VM migration

Last week my colleague was patching the ESX fleet from build 3029944 to build 3343343. Nothing new, has been done many times before and you know it works. Unfortunately it did not go without hassle this time around. After placing a host in maintenance mode the VM migrate as expected however a few VM did lose their connectivity to the network. Early investigation indicated that it was very random. Some of the VM placed on a host build 3029944 had no issues when they arrived on host with new build 3343343 When looking at the common denominators we started to notice that the VM were mostly of build Windows 2008 R2 and part of the same applications. In this case Microsoft Lync and some of the Exchange environment. We did not notice issues with Linux builds. Nothing at the ESX layer indicated that there were issues although the particular VM port on the vDS appeared down.

Concentrating on the OS level we noticed that there were issues with the vmxnet3 driver under device manager. The error message returned was: This device cannot start (Code 10). A couple of workarounds to this are uninstalling the device and rescanning the device or removing the vnic from the VM and adding a new one. Although this fixed the issue we still had no idea as to what caused this issue. In the mean time VMware support was engaged but nothing was obvious as we already discovered. Eventually the problem was found and it turned it is a known bug with build 3343343 as the engineer discovered in an internally published KB.

The issue was due to the fact that some of our VM had their advanced vmxnet3 nic settings altered and in particular the ring size #2 setting. This setting was set on certain applications, such as Lync and Exchange, by our networking architect after experiencing packet loss. You may want to have a look at KB2039495 .


As part of the recommendations the setting was set to 4096 and this is apparently the problem with this bug. The recommendation is 2048 and indeed there is no loss of connectivity after that. This explains why replacing vnic worked too as it wipes out the default setting.