Today i got complains about poor performance of a VM. I found out that it has high storage latency of about 150ms. I also saw that all other VMs on the same datastore has the same high storage latency.
Interesting at this point was that i have a couple of another datastores which are located on the same storage array in the same RAID Set. The VMs on that datastores didn’t´t have any latencies. The performance graphs show a storage latency of almost 0 for all that VMs. So from the storage point of view there is no explanation for that weird behavior.
We have activated Storage I/O Controls on all of our datastores. On the datastore with high storage latency we have also a VM with a configured IOPS limit. Because this VM generates the most traffic on that datastore i set a lower IOPS limit. Shortly after that, the storage latency for all the VMs on that datastore rises. So i tried the other way, i deleted the IOPS limit of that one VM.
Than the magic happened. Storage latency for all VMs on that datastore went to almost 0. Of course this one VM then produced up to 23.000 IOPS without the limit, but luckily our storage is able to handle that amount of requests.
What i don´t understand is why this gonna happen. All documentation i found tells me that it is allowed to use Storage I/O Control and per VM IOPS limits as seen here:
If someone has a idea about this, contact me please!!!
My personal assumption about this:
Storage I/O Control calculates the I/O requests wrong! I think the VM IOPS limit causes high storage latency on that single VM. This is detected by Storage I/O Control which wrongly thinks this one VM did not get enough performance from the storage. (Ignoring the fact that the existing IOPS limit is causing this latency.) Therefore Storage I/O Control slows down all other VMs on that datastore until all VMs have the same latency. (Because all VMs have the same share value.)
Either i found a bug in the Storage I/O Control algorithm or someone please proof me wrong…
11/17/2016 – Morning
VMware Support sends me a mail and points me to KB#2059192. I was told to switch to the old I/O scheduler since i do not use reservations. They also send me a link to Duncan Eppings article about the the mClock scheduler, telling me there i will find informations to understand this behavior.
In his article Duncan refers to a not easily to read pdf document about the internal algorithm of the mClock scheduler. As far as i understand that document, setting a per VM IOPS limit should decrease the storage latency on all other VMs and not increase them. ???
I´m heading back to the VMware support for answers…
11/17/2016 – Evening
Got a phone call from VMware Support. I explained again my problem and that, as far as i understand the description of the mClock algorithm, this should not happen. VMware Support had to admin that this is a known bug in the mClock algorithm. They are currently working on the problem. I was advised to keep an eye on KB#2059192. This KB article does not exactly refer to my problem, but i was told it´s the same cause.
As soon as i can, i will change the I/O scheduler as advised and see if i get better results.