The last months I had again the pleasure to work with the NetApp simulator to test some configuration options. And as always, it broke down shortly after I installed it with the usual full root volume problem. While searching how to fix it, I stumbled over the same pages and forum discussions I saw the last time. I couldn’t believe that the simulator comes with a build-in error. Therefore i made a fresh install and tried to analyze what fills up the root volume and how it can be prevented.
I tracked down the cause of the problem and collected the fixes how to repair it. Then I tested which settings have to be done to prevent the error from appear again.
I downloaded the version 9.8 of the simulator and installed it. Then I made only the absolutely needed basic configuration. I’ve set the clustername to “fullrootvolume” and configured the cluster-ip, node-ip and the needed network settings. I’ve created no aggregate or svm or other things. Then I let it run without touching it.
Only 12 hours later the root volume of the simulator has no space left. The cluster services and the cluster ip address are no longer available and the VMware console show messages like this:
mgmtgwd.rootvolrec.low.space:EMERGENCY]: The root volume on node "fullrootvolume-01" is dangerously low on space. Less than 10 MB of free space remaining. [callhome.root.vol.recovery.reqd:EMERGENCY]: Call home for ROOT VOLUME NOT WORKING PROPERLY: RECOVERY REQUIRED. [rdb.recovery.failed:EMERGENCY]: Error: Unable to find a master. Unable to recover the local database of Data Replication Module: VifMgr. [rdb.recovery.failed:EMERGENCY]: Error: Unable to find a master. Unable to recover the local database of Data Replication Module: Bcom. [rdb.recovery.failed:EMERGENCY]: Error: Unable to find a master. Unable to recover the local database of Data Replication Module: VLDB. [rdb.recovery.failed:EMERGENCY]: Error: Unable to find a master. Unable to recover the local database of Data Replication Module: Crs.
Of course the root volume is full. But the question was, what fills up the root volume and why?
To answer this, i walked through the steps i found on other pages and forum threads.
- Delete volume snapshots
- Delete aggregate snapshots
These steps only helped for a short time. Only hours later the freed space in the volume/aggregate was filled up and the error occurred again.
What i finally found were files named sktrace.log. They can be seen when entering the systemshell and looking into the
I could not really found a detailed description for these files from NetApp, but according to this KB article these log files rotate once a day with a maximum of 35 rotations. From what i could observed, these logs could grow up to a maximum of 100 MB. Since I did not find a way to disable the generation of these logs, this means the root volume must provide additional 3.5 GB space for these logfiles.
Directly after deployment the root volume has a size of 855 MB and only around 40 MB free. This design leads inevitably to the full root volume error within a few hours.
Since there is no other way than provide additional 3.5 GB space, the solution is really simple. Add more disks to the root aggregate and expand the root volume directly after installation of the NetApp simulator. I found out that growing the root volume to at least 5 GB size provides enough space to have a stable environment to let the NetApp simulator run more than a few hours or days.
The steps for doing this are:
- Assigning disks to the node in order to use them
storage disk assign -all -node local
- Adding more disks to the root aggregate
aggr add-disks -aggregate <NAME-OF-ROOT-AGGREGATE> -diskcount <NUMBER-OF-DISKS>
- Expanding the root volume to at least 5 GB size
volume modify -vserver <NODE-NAME> -volume vol0 -size 5GB
Tipps and tricks
While working on this problem, i’ve tried a lot of things to repair a broken simulator. Therefore here is my list of things to do when things already had gone wrong.
Besides my own research, I found helpful informations on the following pages: