Dealing with SnapVault replication issues

Well, for the past two months I had a case open with NetApp to figure out this SnapVault replication issue we were seeing. The initial transfer of the SnapVault relation would complete with a hick up, manual snapshot transfers also work – just the scheduled, auto-created Snapshots won’t replicate.

At first I (and the NetApp support) thought this was an issue with SnapVault itself, however after being away for the last four weeks I looked at the issue with fresh eyes. After a short peek into the logs, I found what I had found back when I first looked into this.

SnapVault would create the daily snapshot on the SnapVault Primary and start the replication. However something (or someone, wasn’t clear at this point) then created a FlexClone of a volume … And as, back when we first encountered this, I was kinda puzzled.

But then I decided (please don’t ask me what made me look there) to look at the logs of the NetApp Filer on our logserver. As it turns out, back when I enabled syslogging to an external logserver I seem to have enabled debug logging … and it was great to have that! Below you’ll find the log I found – and as you can see there’s at least a clue as to from where that ghost snapshot is coming from.

Now, with knowing from which corner this issue originated it dawned on me, we have had a similar issue before. A quick peek into TSM Manager and I knew I was on the right track. The daily system backup starts around 21:15. Now our TSM backup includes the System State backup (which in turn utilizes VSS – which triggers the NetApp Snapshot!).

After excluding the System State from the Daily Backup the SnapVault stuff worked without a hickup. I ended up removing SnapDrive from the Server in question, since we don’t really need it there. Snapshots created from SnapDrive of the boot lun are gonna be inconsistent anyhow (doesn’t matter if I do ’em from SnapDrive or the NetApp CLI).

That restored the default VSS handler, which enables TSM to backup the System State again.

NetApp: SnapVault snapshot retention for non-standard snapshot names

Well, the name says it pretty much. Once you rename the snapshot on the SnapVault destination from daily.0 to something else, the whole builtin SnapVault snapshot retention isn’t gonna work anymore.

Back when I started all the code-writing, I wasn’t aware of this. One of my co-worker complained to me about it on Wednesday that there are an assfull of snapshots on the SnapVault destination (one snapshot each day since the end of October, meaning more than 50 snapshots per volume, in a total of 12 or so FlexVolumes, making the total about 500 snapshots).

So I took the time to write this little Bash script (yeah, I know I’m mixing a bunch of languages – I really like the KISS principle), which will get the necessary information from the filer (snapvault snap sched needs to be set) and then deletes the over-aged snapshots.

Read More

NetApp: Monitoring of SnapVault/SnapMirror/LUN/Snapshot information with Nagios

As I wrote before, we have a bunch of filers (and a ton of volumes w/ luns on them), that I need to monitor. At first, I tried the existing NetApp Nagios-Plugin(s), but they all use SNMP and with that I can either watch all volumes or none. And that didn’t satisfy me.

Don’t get me wrong, the existing plugins are okay and I still use them for stuff (like GLOBALSTATUS or FAN/CPU/POWER) which isn’t present in the API or real hard to get at, however I wanted more. So I ended up looking at the NetApp API, and ended up writing a “short” plugin for Nagios using Perl.

Maybe if I’m ever bored, I’ll rewrite it using C, but for now the Perl plugin has to suffice.

So far the plugin supports the following things:

  • Monitoring FlexVolumes (simply watching the free space)
  • Monitoring LUN space (the allocated space inside a FlexVolume for iSCSI/FC LUNs)
  • Monitoring Snapshot space (the allocated space inside a FlexVolume for Snapshots)
  • Monitoring SnapVault relations (and their age)
  • Monitoring SnapMirror relations (and their age)

The plugin will return performance data for most (if not all) of those classes. It needs a user on the filer you wish to monitor – which sadly needs to have the admin role.

Read More

NetApp: Archive SnapManager SQL snapshots

As I wrote before, we’re using SnapManager (for SQL/Oracle) to create consistent snapshots. However my database guys don’t want to name their snapshots daily.<increment> (which I can understand), as once you archive those snapshots to a secondary (and tertiary) system, the names become junk.

So, they’re naming the snapshots like snap__vcsrv_29_12_2012-10.00.01. Sadly, when it comes to SnapVault, it expects the names in form of daily.<increment> otherwise you won’t be able to transfer the snapshots with the CLI (none that I have found anyway).

But we didn’t want to move away from naming the snapshots the way they are, so I ended up writing a PowerShell script, that once triggered archives the Snapshots needed for a set of databases. It took me a while to figure a bunch of stuff out, but in the end I think I have a working way of archiving custom-named snapshots.

Read More

Implementing SnapVault backups – the hard way

Well, I recently had the pleasant task of implementing SnapVault backups, that are being shipped to an offsite location with SnapMirror.

That in itself isn’t the bad thing, however we decided against Protection Manager (since it was a charged product back when we decided on this). So I basically had the three tasks:

  1. Actually implement the SnapVault stuff (and learn my way around it and also document it)
  2. Write a bunch of scripts, that help us in creating scheduled backups of our databases
  3. Create a monitoring script, that’ll fit into our Nagios environment already in place

Well, two months later (sadly it still has some kinks – I can’t figure out this one bug though for the life of it) and a few hundred hours of working on/with it and out came four things:

  1. Bash-scripts to create the SnapVault/SnapMirror relations
  2. Powershell scripts to trigger the SnapVault updates
  3. a Nagios plugin, based on NetApp’s SDK for Data ONTAP (even if the API is crap from time to time – it’s still better than using SNMP)

I’ll post those things one after another, once I wrote up all the articles.