Solving problems invented by others...
NetApp simulator: Prevent the root volume from filling up

NetApp simulator: Prevent the root volume from filling up

The last months I had again the pleasure to work with the NetApp simulator to test some configuration options. And as always, it broke down shortly after I installed it with the usual full root volume problem. While searching how to fix it, I stumbled over the same pages and forum discussions I saw the last time. I couldn’t believe that the simulator comes with a build-in error. Therefore i made a fresh install and tried to analyze what fills up the root volume and how it can be prevented.

I tracked down the cause of the problem and collected the fixes how to repair it. Then I tested which settings have to be done to prevent the error from appear again.

The problem

I downloaded the version 9.8 of the simulator and installed it. Then I made only the absolutely needed basic configuration. I’ve set the clustername to “fullrootvolume” and configured the cluster-ip, node-ip and the needed network settings. I’ve created no aggregate or svm or other things. Then I let it run without touching it.

Only 12 hours later the root volume of the simulator has no space left. The cluster services and the cluster ip address are no longer available and the VMware console show messages like this:

mgmtgwd.rootvolrec.low.space:EMERGENCY]: The root volume on node "fullrootvolume-01" is dangerously low on space. Less than 10 MB of free space remaining.
[callhome.root.vol.recovery.reqd:EMERGENCY]: Call home for ROOT VOLUME NOT WORKING PROPERLY: RECOVERY REQUIRED.
[rdb.recovery.failed:EMERGENCY]: Error: Unable to find a master. Unable to recover the local database of Data Replication Module: VifMgr.
[rdb.recovery.failed:EMERGENCY]: Error: Unable to find a master. Unable to recover the local database of Data Replication Module: Bcom.
[rdb.recovery.failed:EMERGENCY]: Error: Unable to find a master. Unable to recover the local database of Data Replication Module: VLDB.
[rdb.recovery.failed:EMERGENCY]: Error: Unable to find a master. Unable to recover the local database of Data Replication Module: Crs.

The cause

Of course the root volume is full. But the question was, what fills up the root volume and why?

To answer this, i walked through the steps i found on other pages and forum threads.

  • Delete volume snapshots
  • Delete aggregate snapshots

These steps only helped for a short time. Only hours later the freed space in the volume/aggregate was filled up and the error occurred again.

What i finally found were files named sktrace.log. They can be seen when entering the systemshell and looking into the /mroot/etc/log/mlog/ directory.

I could not really found a detailed description for these files from NetApp, but according to this KB article these log files rotate once a day with a maximum of 35 rotations. From what i could observed, these logs could grow up to a maximum of 100 MB.  Since I did not find a way to disable the generation of these logs, this means the root volume must provide additional 3.5 GB space for these logfiles.

Directly after deployment the root volume has a size of 855 MB and only around 40 MB free. This design leads inevitably to the full root volume error within a few hours.

The solution

Since there is no other way than provide additional 3.5 GB space, the solution is really simple. Add more disks to the root aggregate and expand the root volume directly after installation of the NetApp simulator. I found out that growing the root volume to at least 5 GB size provides enough space to have a stable environment to let the NetApp simulator run more than a few hours or days.

The steps for doing this are:

  • Assigning disks to the node in order to use them
    storage disk assign -all -node local
  • Adding more disks to the root aggregate
    aggr add-disks -aggregate <NAME-OF-ROOT-AGGREGATE> -diskcount <NUMBER-OF-DISKS>
  • Expanding the root volume to at least 5 GB size
    volume modify -vserver <NODE-NAME> -volume vol0 -size 5GB
  • Increasing the memory of the simulator vm to 8GB
    Otherwise the simulator got stuck after a few days. (See here.

Tipps and tricks

While working on this problem, i’ve tried a lot of things to repair a broken simulator. Therefore here is my list of things to do when things already had gone wrong.

Cluster IP and cluster management no longer works

Every time the root volume fills up to 100% the cluster services went down and so the cluster lif.

In that situation the only way to logon to the simulator is via ssh to the node ip or using the VMware console.

Checking the available free space

Volume from the clustershell

When the root volume is full and cluster services are down, I had problems with most of the usual cDOT command of the clustershell. For example, the volume show command gave me the error message Error: "show" is not a recognized command

Aggregate from the clustershell

  • aggr show
    Shows the size and the free space of the aggregate.
  • aggr show-status
    Shows the disks in the aggregate.
  • aggr show-space
    Shows the metadata of the aggregate.

Stepping down to the nodeshell

Because the volumes could not be checked from the clustershell, you need to step down to the nodeshell with this command:

node run local

Volume from the nodeshell

  • df -h vol0
    Shows the size, freespace and snapshot size of the root volume
  • vol status -S
    Shows the metadata of the volume. 

Aggregate from the nodeshell

  • aggr status -S
    Shows the metadata of the aggregate.
Snapshot deletion. Useless but good to know how.

As i pointed out earlier, deleting snapshots only help for a short period of time. But it might be of interest how to delete and disable snapshots.

Assuming that it is already to late because the root volume is full and the cluster services are down, you had to work on snapshots from the nodeshell. Therefore the following commands had to be entered on the nodeshell.

Volume

  • snap list -V vol0
    Shows the current snapshots of the root volume.
  • snap delete -a vol0
    Deletes all snapshots of the root volume.
  • snap sched vol0 0 0 0
    Disables the scheduled creation of snapshots of the root volume.
  • snap reserve -V vol0 0
    Removes the snapshot reserve of the root volume.

Aggregate

  • snap list -A <NAME-OF-THE-ROOT-AGGREGATE>
    Shows the current snapshots of the root aggregate.
  • snap delete -a -A <NAME-OF-THE-ROOT-AGGREGATE>
    Deletes all snapshots of the root aggregate.
  • snap sched -A <NAME-OF-THE-ROOT-AGGREGATE> 0 0 0
    Disables the scheduled creation of snapshots of the root aggregate.
  • snap reserve -A <NAME-OF-THE-ROOT-AGGREGATE> 0
    Removes the snapshot reserve of the root aggregate.

A complete list of the snap command on the nodeshell can be found here.

Deletion of sktrace.log files. Useless but good to know how.

Another way to free some space in the root volume is the deletion of logfiles. I found out that especially files named sktrace.log in the director /mroot/etc/log/mlog/ are using a large amount of space.

To delete this files you had to step down a bit further and enter the systemshell. The steps are perfectly documented here. In short you have to:

  • Unlock the diag user -> security login unlock -username diag
  • Set a password for the diag user -> security login password -username diag
  • Switch to diagnostic privilege -> set -privilege diagnostic
  • Enter the systemshell -> systemshell -node local

Once you have reached the systemshell you can use unix commands to explore the file system structure. In order to delete the mentioned sktrace.log files you have to:

  • Go to the right directory -> cd /mroot/etc/log/mlog/
  • Show the sktrace.log files and their size -> ls -lah sktrace.log*
  • Delete them -> rm -f sktrace.log*

Additionally you can reduce the amount of data in these files by disabling the logging for certain components. This has to be done at the clustershell level with diagnostic privilege. -> debug sktrace tracepoint modify -node * -module * -enabled false

As already mentioned, this doesn’t really help. The problem here is, that disabling the logging only works till the next reboot of the simulator. After a reboot the default values for logging are restored and the root volume fills up again. I couldn’t find a way to disable logging permanently.

Adding space to the root volume

To add space to the root volume we first need disks and a bigger root aggregate.

Assigning all disks to the node

  • If the clustershell still works, use -> storage disk assign -all -node local
  • Otherwise step down to the nodeshell and use -> disk assign all

Adding disks to the root aggregate

  • If the clustershell still works, use -> aggr add-disks -aggregate <NAME-OF-THE-ROOT-AGGREGATE> -diskcount 3
  • Otherwise step down to the nodeshell and use -> aggr add <NAME-OF-THE-ROOT-AGGREGATE> 3@1G
    (For a description of the aggr add command see here.)

Expanding the root volume to at least 5 GB size

  • If the clustershell still works, use -> volume modify -vserver <NODE-NAME> -volume vol0 -size 5GB
  • Otherwise step down to the nodeshell and use -> vol size vol0 5g
    (For a description of the vol size command see here.)
Repairing the cluster database

Providing enough free space in the root volume doesn’t repair the simulator immediately. The cluster services are still offline. The easiest way to start them again seems a reboot of the simulator. But this doesn’t help because ONTAP considers the cluster database as corrupt after the root volumes filled up to 100%. In this state ONTAP wants to replicate the cluster database from another node in the same cluster. Since the simulator is a one node cluster there is no other node.

You can view this at the clustershell with diagnostic privilege with the command -> system configuration recovery node mroot-state show

The output in this situation should look like this:

Node: fullrootvolume-01
Delay WAFL Event Clear: 0

WAFL: Normal
Event: 0x0
WAFL event bootarg is not set

RDB Recovery State: 5050055000
VLDB: RDB auto-recovery failed to find master (5)
Management: Healthy (0)
VifMgr: RDB auto-recovery failed to find master (5)
Bcom: RDB auto-recovery failed to find master (0)
Crs: RDB auto-recovery failed to find master (0)

Since there is no other node in the cluster to replicate the database from, you have to tell the simulator node to treat his own database as healthy again. This can be done with the command -> system configuration recovery node mroot-state clear -recovery-state all

After this command a reboot of the simulator should repair everything.

I also found on multiple websites informations about the boot loader variable bootarg.init.boot_recovery. These pages mention that setting or unsetting the variable help to repair the situation. But on my simulator it never had an effect.


Links

Besides my own research, I found helpful informations on the following pages:


Changes

  • 2022-11-08 : Added newly found informations about increasing the memory of the simulator vm.

Leave a Reply

Your email address will not be published. Required fields are marked *

÷ four = one