The last months I had again the pleasure to work with the NetApp simulator to test some configuration options. And as always, it broke down shortly after I installed it with the usual full root volume problem. While searching how to fix it, I stumbled over the same pages and forum discussions I saw the last time. I couldn’t believe that the simulator comes with a build-in error. Therefore i made a fresh install and tried to analyze what fills up the root volume and how it can be prevented.
I tracked down the cause of the problem and collected the fixes how to repair it. Then I tested which settings have to be done to prevent the error from appear again.
The problem
I downloaded the version 9.8 of the simulator and installed it. Then I made only the absolutely needed basic configuration. I’ve set the clustername to “fullrootvolume” and configured the cluster-ip, node-ip and the needed network settings. I’ve created no aggregate or svm or other things. Then I let it run without touching it.
Only 12 hours later the root volume of the simulator has no space left. The cluster services and the cluster ip address are no longer available and the VMware console show messages like this:
mgmtgwd.rootvolrec.low.space:EMERGENCY]: The root volume on node "fullrootvolume-01" is dangerously low on space. Less than 10 MB of free space remaining.
[callhome.root.vol.recovery.reqd:EMERGENCY]: Call home for ROOT VOLUME NOT WORKING PROPERLY: RECOVERY REQUIRED.
[rdb.recovery.failed:EMERGENCY]: Error: Unable to find a master. Unable to recover the local database of Data Replication Module: VifMgr.
[rdb.recovery.failed:EMERGENCY]: Error: Unable to find a master. Unable to recover the local database of Data Replication Module: Bcom.
[rdb.recovery.failed:EMERGENCY]: Error: Unable to find a master. Unable to recover the local database of Data Replication Module: VLDB.
[rdb.recovery.failed:EMERGENCY]: Error: Unable to find a master. Unable to recover the local database of Data Replication Module: Crs.
The cause
Of course the root volume is full. But the question was, what fills up the root volume and why?
To answer this, i walked through the steps i found on other pages and forum threads.
- Delete volume snapshots
- Delete aggregate snapshots
These steps only helped for a short time. Only hours later the freed space in the volume/aggregate was filled up and the error occurred again.
What i finally found were files named sktrace.log. They can be seen when entering the systemshell and looking into the /mroot/etc/log/mlog/
directory.
I could not really found a detailed description for these files from NetApp, but according to this KB article these log files rotate once a day with a maximum of 35 rotations. From what i could observed, these logs could grow up to a maximum of 100 MB. Since I did not find a way to disable the generation of these logs, this means the root volume must provide additional 3.5 GB space for these logfiles.
Directly after deployment the root volume has a size of 855 MB and only around 40 MB free. This design leads inevitably to the full root volume error within a few hours.
The solution
Since there is no other way than provide additional 3.5 GB space, the solution is really simple. Add more disks to the root aggregate and expand the root volume directly after installation of the NetApp simulator. I found out that growing the root volume to at least 5 GB size provides enough space to have a stable environment to let the NetApp simulator run more than a few hours or days.
The steps for doing this are:
- Assigning disks to the node in order to use them
storage disk assign -all -node local
- Adding more disks to the root aggregate
aggr add-disks -aggregate <NAME-OF-ROOT-AGGREGATE> -diskcount <NUMBER-OF-DISKS>
- Expanding the root volume to at least 5 GB size
volume modify -vserver <NODE-NAME> -volume vol0 -size 5GB
- Increasing the memory of the simulator vm to 8GB
Otherwise the simulator got stuck after a few days. (See here.)
Tipps and tricks
While working on this problem, i’ve tried a lot of things to repair a broken simulator. Therefore here is my list of things to do when things already had gone wrong.
Cluster IP and cluster management no longer works
Every time the root volume fills up to 100% the cluster services went down and so the cluster lif.
In that situation the only way to logon to the simulator is via ssh to the node ip or using the VMware console.
Checking the available free space
Volume from the clustershell
When the root volume is full and cluster services are down, I had problems with most of the usual cDOT command of the clustershell. For example, the volume show
command gave me the error message Error: "show" is not a recognized command
Aggregate from the clustershell
aggr show
Shows the size and the free space of the aggregate.
aggr show-status
Shows the disks in the aggregate.
aggr show-space
Shows the metadata of the aggregate.
Stepping down to the nodeshell
Because the volumes could not be checked from the clustershell, you need to step down to the nodeshell with this command:
node run local
Volume from the nodeshell
df -h vol0
Shows the size, freespace and snapshot size of the root volume
vol status -S
Shows the metadata of the volume.
Aggregate from the nodeshell
aggr status -S
Shows the metadata of the aggregate.
Snapshot deletion. Useless but good to know how.
As i pointed out earlier, deleting snapshots only help for a short period of time. But it might be of interest how to delete and disable snapshots.
Assuming that it is already to late because the root volume is full and the cluster services are down, you had to work on snapshots from the nodeshell. Therefore the following commands had to be entered on the nodeshell.
Volume
snap list -V vol0
Shows the current snapshots of the root volume.
snap delete -a vol0
Deletes all snapshots of the root volume.
snap sched vol0 0 0 0
Disables the scheduled creation of snapshots of the root volume.
snap reserve -V vol0 0
Removes the snapshot reserve of the root volume.
Aggregate
snap list -A <NAME-OF-THE-ROOT-AGGREGATE>
Shows the current snapshots of the root aggregate.
snap delete -a -A <NAME-OF-THE-ROOT-AGGREGATE>
Deletes all snapshots of the root aggregate.
snap sched -A <NAME-OF-THE-ROOT-AGGREGATE> 0 0 0
Disables the scheduled creation of snapshots of the root aggregate.
snap reserve -A <NAME-OF-THE-ROOT-AGGREGATE> 0
Removes the snapshot reserve of the root aggregate.
A complete list of the snap command on the nodeshell can be found here.
Deletion of sktrace.log files. Useless but good to know how.
Another way to free some space in the root volume is the deletion of logfiles. I found out that especially files named sktrace.log
in the director /mroot/etc/log/mlog/
are using a large amount of space.
To delete this files you had to step down a bit further and enter the systemshell. The steps are perfectly documented here. In short you have to:
- Unlock the diag user ->
security login unlock -username diag
- Set a password for the diag user ->
security login password -username diag
- Switch to diagnostic privilege ->
set -privilege diagnostic
- Enter the systemshell ->
systemshell -node local
Once you have reached the systemshell you can use unix commands to explore the file system structure. In order to delete the mentioned sktrace.log files you have to:
- Go to the right directory ->
cd /mroot/etc/log/mlog/
- Show the sktrace.log files and their size ->
ls -lah sktrace.log*
- Delete them ->
rm -f sktrace.log*
Additionally you can reduce the amount of data in these files by disabling the logging for certain components. This has to be done at the clustershell level with diagnostic privilege. -> debug sktrace tracepoint modify -node * -module * -enabled false
As already mentioned, this doesn’t really help. The problem here is, that disabling the logging only works till the next reboot of the simulator. After a reboot the default values for logging are restored and the root volume fills up again. I couldn’t find a way to disable logging permanently.
Adding space to the root volume
To add space to the root volume we first need disks and a bigger root aggregate.
Assigning all disks to the node
- If the clustershell still works, use ->
storage disk assign -all -node local
- Otherwise step down to the nodeshell and use ->
disk assign all
Adding disks to the root aggregate
- If the clustershell still works, use ->
aggr add-disks -aggregate <NAME-OF-THE-ROOT-AGGREGATE> -diskcount 3
- Otherwise step down to the nodeshell and use ->
aggr add <NAME-OF-THE-ROOT-AGGREGATE> 3@1G
(For a description of the aggr add
command see here.)
Expanding the root volume to at least 5 GB size
- If the clustershell still works, use ->
volume modify -vserver <NODE-NAME> -volume vol0 -size 5GB
- Otherwise step down to the nodeshell and use ->
vol size vol0 5g
(For a description of the vol size command see here.)
Repairing the cluster database
Providing enough free space in the root volume doesn’t repair the simulator immediately. The cluster services are still offline. The easiest way to start them again seems a reboot of the simulator. But this doesn’t help because ONTAP considers the cluster database as corrupt after the root volumes filled up to 100%. In this state ONTAP wants to replicate the cluster database from another node in the same cluster. Since the simulator is a one node cluster there is no other node.
You can view this at the clustershell with diagnostic privilege with the command -> system configuration recovery node mroot-state show
The output in this situation should look like this:
Node: fullrootvolume-01
Delay WAFL Event Clear: 0
WAFL: Normal
Event: 0x0
WAFL event bootarg is not set
RDB Recovery State: 5050055000
VLDB: RDB auto-recovery failed to find master (5)
Management: Healthy (0)
VifMgr: RDB auto-recovery failed to find master (5)
Bcom: RDB auto-recovery failed to find master (0)
Crs: RDB auto-recovery failed to find master (0)
Since there is no other node in the cluster to replicate the database from, you have to tell the simulator node to treat his own database as healthy again. This can be done with the command -> system configuration recovery node mroot-state clear -recovery-state all
After this command a reboot of the simulator should repair everything.
I also found on multiple websites informations about the boot loader variable bootarg.init.boot_recovery
. These pages mention that setting or unsetting the variable help to repair the situation. But on my simulator it never had an effect.
Links
Besides my own research, I found helpful informations on the following pages:
Changes
- 2022-11-08 : Added newly found informations about increasing the memory of the simulator vm.