Tuesday, 20 October 2015

High CPU Contention in vROPS

Last week I was investigating a VM that was perceived to have CPU performance issues. As always I investigate the following metrics first to rule out contention:

  • CPU Ready
  • Co-Stop
  • Swap Wait 
  • I/O Wait
The values of the metrics returned were well within the acceptable range and thus did not indicated any contention. However the CPU Contention metric indicated a rather high value. I saw this pattern across most of the VM too.


So what is the CPU Contention metric all about? From what I understand it is not a granular metric but a derived metric which allows you to quickly spot that the VM is suffering from CPU contention.
You can then inspect the individual metrics I mentioned above. Just to make sure I double checked the individual CPU metrics in vROPS but nothing there that is of concern. Obviously something is not quite right so let's investigate the workload of our VM


My VM is configured with 10Ghz of CPU and its demand, indicated by the green bar, is 5 Ghz.
Usage, as indicated by the grey bar, is 2 Ghz. Demand is what is requested and usage is what is delivered. Since the VM does not get the resources that are requested we have to conclude there is contention somewhere. As I really could not find anything that pointed to contention I turned to Google and started seeing some reports that this could be caused by CPU power management policies. Some people reported they had this issue and it was fixed by disabling power management.
Worth testing it for myself....

The procedure will be different depending on your hardware. In my case it was HP hardware. The following VMware KB may come in handy.
Power management on ESXi can be managed via the host if the host bios support OS control mode.
ESX supports 4 modes of power management:

  • High Performance
  • Low Performance
  • Balanced
  • Custom
As other people suggested, the high performance policy fixed their issue. This effectively disables power management. You can change this setting without disruption but you will need to reboot your host to ensure the setting is applied. Select your Host > Configuration > Hardware > Power Management. Set to High Performance


In case of HP hardware, and depending on the generation, you will need to set the power profile and power regulator too. Both of these options can be set in the BIOS and the latter can be done via ILO interface too. The power profile allows for 4 settings:

  • Balanced Power and Performance (Default)
  • Minimum Power Usage
  • Maximum Performance
  • Custom
I changed this to custom as this allows most flexibility and makes all options available.

The Power regulator allows you to configure the following profiles 

  • Dynamic Power Savings Mode
  • Static Low Power Mode
  • Static High Performance Mode
  • OS Control Mode
This can be changed from either the BIOS or ILO interface. If changing in ILO you can do it at anytime but will not take affect until you reboot


I changed the option to OS control mode as it actually ensures that the processors run in their maximum power and performance mode unless you change the profile via the OS. We did set the policy to High Performance in ESXi so we have now effectively disabled all power savings.

So has all this work actually made a difference? Let's check!
In vROPS we see that our CPU Contention metric did make a difference indeed.


We did determine that "contention = demand - usage". When looking at the demand and usage under workload we can see that these are now the same which means that the VM is getting the resources it is requested.



Looks like our contention is gone. Although there were really no complaints from users in regards to performance, one colleague found that one of his VM was performing better after this change.

Although the focus of this post was on vROPS there are other ways of determining whether there is contention. In ESXtop I found no issue with the likes of %RDY and %CSTP but I did notice that the %LAT_C entry was high and it also indicated that the VM wanted (%RUN) more resources than it received (%USED).  This blog post by Michael Wilmsen does a very good job explaining it in more detail.





4 comments:

  1. This comment has been removed by the author.

    ReplyDelete
  2. Thanks for this! Still applicable in 2018.

    ReplyDelete
  3. This comment has been removed by a blog administrator.

    ReplyDelete