Dealing with SnapVault replication issues

Well, for the past two months I had a case open with NetApp to figure out this SnapVault replication issue we were seeing. The initial transfer of the SnapVault relation would complete with a hick up, manual snapshot transfers also work – just the scheduled, auto-created Snapshots won’t replicate.

At first I (and the NetApp support) thought this was an issue with SnapVault itself, however after being away for the last four weeks I looked at the issue with fresh eyes. After a short peek into the logs, I found what I had found back when I first looked into this.

SnapVault would create the daily snapshot on the SnapVault Primary and start the replication. However something (or someone, wasn’t clear at this point) then created a FlexClone of a volume … And as, back when we first encountered this, I was kinda puzzled.

But then I decided (please don’t ask me what made me look there) to look at the logs of the NetApp Filer on our logserver. As it turns out, back when I enabled syslogging to an external logserver I seem to have enabled debug logging … and it was great to have that! Below you’ll find the log I found – and as you can see there’s at least a clue as to from where that ghost snapshot is coming from.

Now, with knowing from which corner this issue originated it dawned on me, we have had a similar issue before. A quick peek into TSM Manager and I knew I was on the right track. The daily system backup starts around 21:15. Now our TSM backup includes the System State backup (which in turn utilizes VSS – which triggers the NetApp Snapshot!).

After excluding the System State from the Daily Backup the SnapVault stuff worked without a hickup. I ended up removing SnapDrive from the Server in question, since we don’t really need it there. Snapshots created from SnapDrive of the boot lun are gonna be inconsistent anyhow (doesn’t matter if I do ’em from SnapDrive or the NetApp CLI).

That restored the default VSS handler, which enables TSM to backup the System State again.

TSM and NetApp – Another Quick Hint

Well, we’ve been trying to come up with a decent way to backup NetApp snapshots to tape (SnapMirror To Tape), so we evaluated all the available methods of using NDMP backups.

  1. There’s Image Backup in two different variants – FULL and DIFFerntial
  2. There’s SnapMirror To Tape

So the Image Backup is one of the ways. However the DIFFerntial backup only works for CIFS and NFS shares (which we don’t use). We only have FC luns (or rather FCoE luns), so there’s only a single (or in case of the boot luns more than one) file in each volume. With that however, each run of the Image Backup with the DIFFerential option, it’s gonna backup the full size of the volume (plus the deduplicated amount).

The SnapMirror To Tape option presents another problem: We intend to use SnapManager for SQL/Oracle, which creates “consistent” snapshots of the database luns. However the SnapMirror To Tape backup doesn’t have an option to use an already existing snapshot, but creates another one. Which puts the whole SnapManager business down the curb. So we either do use a SnapMirror To Disk from one database lun to another controller and then run the SnapMirror To Tape from the second controller or come up with another way to back them up to TSM.

TSM and NetApp – Quick Hint

Well, to save everyone else the trouble (since it isn’t documented anywhere – and I just spent about an hour finding the cause for this), if you need to configure NDMP on your NetApp Filer, make sure you also configure an interface other than e0M.

Apparently the necessary controlport for NDMP (10000) is being blocked on e0M, thus ndmp may be configured and running, however TSM is gonna complain that it is unable to connect to the specified data mover.

Doing TSM’s job on Windows Server 2008

Ran into another weird problem the other day … Had a few Windows boxens running out of space. Why ? Well, because TSM includes a System-State backup when creating the daily incremental. Now, apparently (as stated by the IBM support) it isn’t TSM’s job to keep track of the VSS snapshots but rather Windows’. Now by default, if you don’t click on the VSS properties of a Windows drive, there is no limit on the volume. Thus, VSS is slowly eating up all your space.

That isn’t the worst of it, but when you want to delete it all … With Windows 2003 you would just this:

However, as with everything Microsoft, Windows 2008 R2 does it a little bit different. As a matter of fact, it won’t allow you to delete application triggered snapshots (as you can see in the example below), so you’re basically shit-out-of-luck.

Well, not really … diskshadow to the rescue. Simply running diskshadow with a simple script like this:

Just for clarification this isn’t my own work, it was someone elses.

TSM Client: Service Script for Solaris 10

Today I’ve been fighting with Solaris 10 and the SMF Manifest (others would call it init-script …). Since I wanted to do it the proper way (I could have used a “old-style” init-script, but I didn’t wanna ..), I ended up combing the interweb for examples .. As it turns out, not even IBM has documented a way, on how to do this.

In the end this is what I’ve come up with:

However, in order to get the scheduler client working on Solaris, I had to create a little helper script in /opt/tivoli/tsm/client/ba/bin named dsmc.helper:

With that, I was able to automate the TSM Scheduler Client startup on Solaris.

VMware Consolidated Backup and TRANSPORT_MODE=”hotadd”

As the title says, I’ve been playing with vCB (inside a VM) and the TSM integration with newer (>6.0) clients for work. Result of all this work should be a feasibility study. We’re currently thinking about replacing our VMware server(s) with ESXi. But as most of you know, if you install ESXi, you simply can’t install anything (well, you can .. on ~100KB of disk space, which is compared to a TSM client weighing roughly 120MB nothing!). As we would like the possibility to backup VMs on image-level, I went looking at solutions.

  1. VMware Data Recovery
  2. VMware Consolidated Backup
  3. vRanger, ……

As I was looking for something that wouldn’t cost us any money (thus excluding the third), I took a look at vDR and vCB. One point I do have to give to vDR is, that it’s damn fast. Only bad thing about vDR is that it doesn’t integrate at all with TSM, and it ain’t supported to install a TSM client inside the vDR VM. So vDR was also done for.

Only remaining thing was vCB. I remember way back when TSM didn’t support vCB directly, at which time it was *quite* the hassle to configure. But with newer TSM clients (as in the newer 6.x ones), IBM decided to integrate support for it. Which makes setting things up quite easy. You may think at least.

Since I wanted to use “hotadd” as transport mode for the vmdk’s (which is basically creating a snapshot of the vmdk and assigning that snapshot to the vCB VM), I did have to tinker around with some JavaScript files in %ProgramFiles%VMwareVMware Consolidated Backup. Sure, it isn’t supported by VMware (which is a bit lame since they announced the EOL for vCB with the upcoming vSphere version), but I didn’t want to open a support request. I’m lazy, yep:

Change DEFAULT_TRANSPORT_MODE in utils.js from “san” to “hotadd“. But apparently this only solved the backup method for vmdk-level, but not for file-level backups. The file-level is still gonna use nbd (network block device), which kinda sucks since the backup is going out via network.

After doing that, the hotadd mode is still gonna fail, since apparently the denoted “VMware Consolidated Backup User” (vcb-user in my case) also needs permissions onto the datastore. The permissions the handbook sets for the user are okay, you just need to apply that role to your datastore(s) containing the VMs you want to backup too! Otherwise vcbMounter is gonna fail with a rather cryptic error telling you that it doesn’t have sufficient rights to create a linked clone.

Converting TIVSM RPMs to deb

We received a preinstalled customer server the other day, for which we had declared “as-is” support only, since it is running Lucid Lynx. Now today, I started getting the TSM client to work. Was kinda weird, since at first dsmc was reporting something like this:

# ./dsmc: no such file or directory

After fiddling with it a bit more, here are the control files, as well as the prerm and postinst-scripts for TIVSM-API, TIVSM-API64 and TIVSM-BA:

tivsm-api/debian/control:

tivsm-api/debian/tivsm-api.postinst:

tivsm-api/debian/tivsm-api.prerm:

tivsm-api64/debian/control:

tivsm-api64/debian/postinst:

tivsm-api64/debian/prerm:

tivsm-ba/debian/control:

tivsm-ba/debian/tivsm-ba.postinst:

tivsm-ba/debian/tivsm-ba.prerm:

All that was left to do, was simply adding a -n to the dh_makeshlibs call in each packages debian/rules file, otherwise dh_makeshlibs would overwrite my shiny postinst/prerm actions!

TS7530 authentification failure

Today, I had a rather troublesome morning. Once I got to work, Nagios was already complaining about the lin_taped on one of our TSM servers, which apparently failed due to too many SCSI resets. Additionally, I can’t login using the VE console (I can login however using SSH) so I ended up opening up a IBM Electronic Service Call (ESC+).

Using SSH, I can get some information on the VE’s status:

After looking a bit deeper, it seems that none of the two TSM server is able to see the IBMchanger devices for the first VTL. The second is perfectly visible, just not the first. After putting both VE nodes into suspended failover, gathering support data for the IBM support from both VE’s and the Brocade SAN switches, apparently everything works again. I guess the library does have “self healing” properties.

OCF agent for Tivoli Storage Manager: redux

Well, after I finished my first OCF agent back in October 2008, we have it running in production now for about ten months. During that time, we found quite a few points in which we’d like to improve the behaviour with that Linux-HA should handle TSM.

  • Shutdown TSM nicely if possible (Cancel client sessions, cancel running processes and dismount mounted volumes)
  • Better error handling

So, after another week of writing and testing with a small instance, I present the new OCF agent for Tivoli Storage Manager. It still has one or two weak points, but they are negligible. I still need to write the documentation for it, but the script should just work …

Read More

Weird TS3500 problem: redux

Well, after yesterday’s episode with our tape library today continued to be a taxing day. After restarting a few exports that were hanging yesterday due to our library problems, something similar returned. TSM was unable to locate a few (two to be exact) tapes in the library.

Yet the library reported the tapes were still inventoried. *shrug* Here we are again, looking completely baffled. After a short while trying to figure out what to do, we went through the Data Cartridge inventory again. As it turns out, through putting the library in “Pause”-Mode and restarting TSM multiple times, TSM apparently completely forgot that it had these tapes put into drives.

After manually moving the tapes back to their home slot via the management interface of the TS3500 and setting the volume access mode back to read-write, everything is fine now I could finish my pending exports!