Some time ago i was confronted with a VM that cause a lot of I/O. I was told to analyze the cause of this and find ways to calm this VM down since it consumes all the performance of our storage.
While analyzing the VM i saw the following characteristics:
- Every 30 minutes i see a IOPS burst up to 28.000. (This seems to be the maximum our storage could deliver.)
- This bursts last up to 2 minutes. After the burst the VM is almost silent with no more than 20 IOPS. This lasts until the next burst.
My first try was to limit the IOPS of the VM. Since today this is not possible with VMware if your VMs resides on NFS storage. (This is another story which i will hopefully blog about next time.)
So the next try was to put the VM on a dedicated datastore and limit that datastore to a maximum amount of IOPS. Again i see that every 30 minutes a burst of IOPS occoured. But because of the IOPS limit it lasts up to 20 minutes. So the first thing i learned is that limiting IOPS doesn´t reduce the requests.😉
Ok, now that i know that the VM really demands that amount of IOPS i started researching what´s inside that VM. I turns out that this VMs contains a windows operating system with an application for monitoring printers. (How much pages were printed, how much toner is left and so on.) Contacting the Vendor of this application was not useful since they can´t say anything about this high IOPS bursts. They could not even tell what process runs every 30 minutes which causes this high storage pressure. (OK, they programmed this shit. How could i assume they know what they did?)
So i asked some colleagues for help with this application. Together we found out that the application uses a Microsoft SQL Express instance for storing data. We could see that the sqlserver.exe process causes that high amount of IOPS every 30 minutes. After checking the parameters of the SQL server instance we found the cause of the problem. The SQL instance was configured to grow the SQL database file about 1 MB every time it needs more space. This small value caused a high fragmentation inside the SQL database file structure because there is never much free space inside the database file to write new objects in one piece. The most influenced part of this setting was the database index itself. Since the index was read every time this unknown job started every 30 minutes, the database server had to read a lot of blocks to simply find even one record.
We solved the problem by setting the database to grow the database file by 100 MB. This gives the SQL server enough free space inside the database file to store new objects in one piece. After that we recreated the database index new to have it in one piece inside the database file.
After that changes, the IOPS demands of the VM drops to to a few hundred IOPS every 30 Minutes for about a few seconds. With this solution we reduced the pressure on our storage a lot.