TSM: Restoring the database/recovery log to a point-in-time

Well, my co-worker just called on my cell (it’s Friday, 16:00), and asked me which start-up script he needed to change in order to restore the database. My first response was, “ummm, that’s gonna be hard, we’re using heartbeat”.

Okay, so after a bit of asking I got out of him what he wanted to achieve by changing the start-up script. Apparently he did something to crash Tivoli Storage Manager (or rather repeatedly crash it) and wanted to restore the database. He talked to one of the systems partner we do have (and I’m happy we have them most of the time), who in return told him how to do it, but forgot a minute after he hung up the phone.

So, I went digging while he still was telling me how he got Tivoli to kick his own ass … After a bit, I thought “hrrrrrm, shouldn’t this be covered in the Tivoli documentation ?”, and surprisingly it’s actually covered in the documentation.

It’s actually rather simple.

  1. Stop the dsmserv Linux-HA cluster service (tsm-control ha stop tsm1)
  2. Setup the environment (since we’re running multiple instances of Tivoli Storage Manager – export DSMSERV_DIR, export DSMSERV_CONFIG)
  3. Enter the path of the server
  4. Run dsmserv restore db
  5. Wait some time (took about half an hour to restore the 95G database and the 10G recovery log)
  6. Start the dsmserv Linux-HA cluster service (tsm-control ha start tsm1)
  7. Update the server-to-server communication, since the restore db changes the communication verification token

Nagios: Service Check Timed Out

Since I got the pleasure of watching some Windows boxen with Nagios, I took the Windows Update plugin from Michal Jankowski and implemented it. It took me some time, to initially set up the nsclient++ correctly so it just works, but up till now the check plugin sometimes reported the usual “Service Check Timed Out”.

Usually I ended up increasing the cscript timeout, or the nsclient++ socket timeout, but it still kept showing up. Since I rely heavily on my surveillance tools, I have the demand, that as few as possible false positives show up. So I ended up chasing down this error today, and after that I have to say it was quite simple.

In my case, it wasn’t cscript (that timeout is set to 300 seconds), neither nsclient++ (socket timeout is set to 300 seconds too), nor the nrpe plugin itself (that has 300 seconds as well).

As it turns out, Nagios got an additional setting controlling these things, called service_check_timeout which defaults to 60 seconds. Sadly the plugin, or rather Windows needs longer than those 60 seconds to figure out whether or not it needs updating, thus Nagios is killing the plugin and returning a CRITICAL message.

After increasing the value of service_check_timeout that’ll be fixed hopefully.

SLES10: zypper.log

Well, I just stumbled upon something .. My Nagios at work wasn’t working anymore, and I went looking.

After that, zip – nada. Next thing, check whether or not the device is really full … Okay, df ..

So, it is actually completely filled up. So, now we need to find who’s hogging the space. Since I had a assumption (pnp4nagios), I went straight for /var/lib …

That wasn’t it .. so heading to the next place, that’s suspicious most of the time, /var/log.

I was like “WTF ? 5.2G for YaST2 logs ?” when I initially saw that output … As of now, I got a crontab emptying /var/log/YaST2 every 24 hours …

Nagios: SNMP OID’s for IBM’s RSA II adapter

Well, after some poking around I finally found some OID’s for the RSA’s (only through these two links: check_rsa_fan and check_rsa_temp).

For Nagios, I dismissed the fans, since the fan speed is only passed on in percent values. So I only added this:

Oh, and if anyone else is curious like me, here’s the list with the OID’s, courtesy of Gerhard Gschlad and Leonardo Calamai.

For the fans:

And for the temperatures:

I just found a proper list of OID’s for the IBM RSA adapter. That’s rather nice, since I really was looking for the OID’s for the VRM failure OID and other warning/critical events.