OCF agent for Tivoli Storage Manager: redux

Well, after I finished my first OCF agent back in October 2008, we have it running in production now for about ten months. During that time, we found quite a few points in which we’d like to improve the behaviour with that Linux-HA should handle TSM.

  • Shutdown TSM nicely if possible (Cancel client sessions, cancel running processes and dismount mounted volumes)
  • Better error handling

So, after another week of writing and testing with a small instance, I present the new OCF agent for Tivoli Storage Manager. It still has one or two weak points, but they are negligible. I still need to write the documentation for it, but the script should just work …

Read More

TSM: Restoring the database/recovery log to a point-in-time

Well, my co-worker just called on my cell (it’s Friday, 16:00), and asked me which start-up script he needed to change in order to restore the database. My first response was, “ummm, that’s gonna be hard, we’re using heartbeat”.

Okay, so after a bit of asking I got out of him what he wanted to achieve by changing the start-up script. Apparently he did something to crash Tivoli Storage Manager (or rather repeatedly crash it) and wanted to restore the database. He talked to one of the systems partner we do have (and I’m happy we have them most of the time), who in return told him how to do it, but forgot a minute after he hung up the phone.

So, I went digging while he still was telling me how he got Tivoli to kick his own ass … After a bit, I thought “hrrrrrm, shouldn’t this be covered in the Tivoli documentation ?”, and surprisingly it’s actually covered in the documentation.

It’s actually rather simple.

  1. Stop the dsmserv Linux-HA cluster service (tsm-control ha stop tsm1)
  2. Setup the environment (since we’re running multiple instances of Tivoli Storage Manager – export DSMSERV_DIR, export DSMSERV_CONFIG)
  3. Enter the path of the server
  4. Run dsmserv restore db
  5. Wait some time (took about half an hour to restore the 95G database and the 10G recovery log)
  6. Start the dsmserv Linux-HA cluster service (tsm-control ha start tsm1)
  7. Update the server-to-server communication, since the restore db changes the communication verification token

Nagios: Watching Clustered environments (the other way)

Well, recently I stepped up to watch our cluster environments … Michael has a good howto on how to watch Windows Cluster environments in the NSclient++ wiki.

Now, this has it’s own perks … Which I stumbled upon when trying to write a Linux-HA OCF resource agent for the Nagios NRPE server. Combining that Linux-HA with SLES10 is a good thing generally, but using startproc in that resource agent is not such a good idea.

Apparently Novell (or SuSE GmbH) thought it might be wise to include some additional logic into the wrapper. startproc, checkproc and killproc do check for the name of the executable. So if you try to start an additional process with the same name, you need to dig a bit deeper.

For this to work, you need two additional things (quotations directly from man 8 startproc):

-p pid_file
(Former option -f changed due to the LSB specification.) Use an alternate pid file instead of the default (/var/run/<basename>.pid). The pid read from this file is being matched against the pid of running processes that have an executable with specified path of the program. In order to avoid confusion with stale pid files, a not up-to-date pid will be ignored.

Now, then apparently this isn’t enough. startproc is still refusing to start a second process.

-i ignore_file
The pid found in this file is used as session id of the same binary program which should be ignored by startproc.

Read More

Linux-HA: Creating a random authkey

I just looked over the slides of a presentation one of my trainees bought back from Chemnitz, and there was this nifty one-line command, with which you can generate a random sha1sum for your authkeys file.

Now, since I’m a bit lazy here’s the full command line to fill /etc/ha.d/authkeys for you:

Linux-HA and Tivoli Storage Manager (Finito!)

As I previously said, I was writing my own OCF resource agent for IBM’s Tivoli Storage Manager Server. And I just finished it yesterday evening (it took me about two hours to write this post).

Trac revision log (shortened)
Trac revision log (shortened)

Only took me about four work days (that is roughly four hours each, which weren’t recorded in that subversion repository) plus most of this week at home (which is 10 hours a day) and about one hundred subversion revisions. The good part about it is, that it actually just works 😀 (I was amazed on how good actually). Now you’re gonna say, “but Christian, why didn’t you use the included Init-Script and just fix it up, so it is actually compilant to the LSB Standard ?”

The answer is rather simple: Yeah I could have done that, but you also know that wouldn’t have been fun. Life is all about learning, and learn something I did (even if I hit the head against the wall from time to time 😉 during those few days) … There’s still one or two things I might want to add/change in the future (that is maybe next week), like

  • adding support for monitor depth by querying the dsmserv instance via dsmadmc (if you read through the resource agent, I already use it for the shutdown/pre-shutdown stuff)
  • I still have to properly test it (like Alan Robertson mentioned in his one hour thirty talk on Linux-HA 2.0 and on his slides, Page 100-102) in a pre-production environment
  • I’m probably configure the IBM RSA to act as a stonith device (shoot the other node in the head) – just for the case one of them ever gets stuck in a case, where the box is still up, but doesn’t react to any requests anymore

Read More

Setting up Linux-HA

Well, initially I thought writing the OCF resource agent for Tivoli Storage Manager was the hard part. But as it turns out, it really ain’t. The hard part, is getting the resources into the heartbeat agent (or whatever you wanna call it). The worst part about it, is that the hb_gui is completely worthless if you want to do a configuration without quorum.

First of all, we need to setup the main Linux-HA configuration file (/etc/ha.d/ha.cf). Configuring that, is rather simple. For me, as I do have two network devices, over which both nodes see each other (one is an adapter bond of comprising of two simple, plain, old 1G copper ports; the other is the 1G fibre cluster port), the configuration looks like this:

After configuring the service itself is done, one just needs to start the heartbeat daemon on both nodes. Afterwards, we should be able to configure the cluster resources.

I find it particularly easier to just update the corresponding sections with cibadmin (the man-page really has some good examples). So here are my configuration files for two resource groups (crm_mon doesn’t difference between resources and grouped resources, it’ll just show you that you configured two resources).
Read More

Linux-HA and Tivoli Storage Manager

Well, since we received part of our shipment on Wednesday, I finally looked at how we’re gonna deploy our active/active Tivoli Storage Manager configuration. Right now, we do have a single pSeries box hosting ~100 client nodes which we’re looking to split by two (since we do have two x366 for that purpose now).
Now, as there ain’t no solution for this scenario yet (neither from International Business Machines nor someone out of the open source community), I sat down and started writing an OCF Resource agent for dsmserv (that is the Tivoli Storage Manager server).
At first I had a bit trouble adjusting myself on how stupid/non-standard dsmserv is, but after reading through the Storage Manager Installation handbook (on multiple installations on a single server) and through some peoples notes on multiple deployments of Tivoli Storage Manager on the same server, I think I managed to get my head around it.
I still think the resource agent lacks some real testing (I put a two node cluster online on Tuesday, but that is non-productive), but that’ll happen soon.

As you can see, I reworked the “stop” phase, to first terminate all running processes and then dismount all tapes in order to avoid data corruption (that was an advice from our friendly IBM systems engineer); if that fails, try terminating it by a “friendly” kill (SIGTERM); and if that ain’t helping, kill it the “Die Hard Way”™ (SIGKILL).