Weird TS3500 problem

Well, today we had a rather weird problem with our TS3500. TSM running on AIX basically went bonko and spit out weird media sense errors, all stating that there is a hardware or media error of unknown nature:

After restarting the TSM server (as in the service, not the whole box) five times, which didn’t resolve squat we decided to take a look at the TS3500 itself. We opened up the Management interface and tried moving a tape into a drive. That didn’t work. Hrmmmmm.

We tried the manual move from the LCD display mounted on the front of the TS3500 base frame, that didn’t work either. So we figured the gripper was stuck and placed a call with our trustworthy support provider.

After a few minutes, they called us back and told us: “Try the following: Place the library in “Pause”-Mode and open it up, maybe a tape fell down …“.

We did exactly that, the gripper moved back to it’s pause position (which is in the base frame), and we started looking inside after opening up the base frame and an expansion frame. Nothing …

So we closed it back up, and let the base frame resume it’s normal duties … guess what: After resuming normal operations, it worked again *shrug*

Novell KMP: vmware-tools-kmp and ibm-lin_tape-kmp

Disclaimer: I don’t take any responsibility for faults within the software, I just provide the RPM’s! Feel free to ask me about stuff concerning these RPM’s, but I ain’t accountable if your stuff goes kaboom … Oh, and those RPM’s aren’t recommended or supported by Novell or IBM!

After working with the novell-kmp solution, I think it’s actually rather easy to create a “Kernel Module Package“. In the end, I created two additional KMP’s, one for the tools component of the VMware-Tools shipped with VMware ESX, and another for the lin_tape SCSI driver, used by our IBM TS3400 as well as the IBM TS7530.

Some parts (especially the build system used within the VMware kernel modules) took some figuring out/playing around, but I actually got it working. Now each time I update the VMware-Tools I just need to install the new RPM, tada! No need for a fully fledged build environment on every box.

SUSE Linux Enterprise Server 10:

  • ibm-lin_tape-1.24.0_2.6.16.60_0.37_f594963d-0.1 (i586, x86_64, SRPM)
    • ibm-lin_tape-kmp-bigsmp (i586)
    • ibm-lin_tape-kmp-debug (i586, x86_64)
    • ibm-lin_tape-kmp-default (i586, x86_64)
    • ibm-lin_tape-kmp-kdump (i586, x86_64)
    • ibm-lin_tape-kmp-kdumppae (i586)
    • ibm-lin_tape-kmp-smp (i586, x86_64)
    • ibm-lin_tape-kmp-vmi (i586)
    • ibm-lin_tape-kmp-vmipae (i586)
  • vmware-tools-kmp-3.5.0_153875_2.6.16.60_0.37_f594963d-0.1 (SRPM)
    • vmware-tools-kmp-bigsmp (i586)
    • vmware-tools-kmp-debug (i586, x86_64)
    • vmware-tools-kmp-default (i586, x86_64)
    • vmware-tools-kmp-kdump (i586, x86_64)
    • vmware-tools-kmp-kdumppae (i586)
    • vmware-tools-kmp-smp (i586, x86_64)
    • vmware-tools-kmp-vmi (i586)
    • vmware-tools-kmp-vmipae (i586)
    • vmware-tools-kmp-xen (i586, x86_64)
    • vmware-tools-kmp-xenpae (i586)

SUSE Linux Enterprise Server 11:

  • ibm-lin_tape-1.24.0_2.6.27.21_0.1-0.1 (i586, x86_64, SRPM)
    • ibm-lin_tape-kmp-debug (i586, x86_64)
    • ibm-lin_tape-kmp-default (i586, x86_64)
    • ibm-lin_tape-kmp-pae (i586)
    • ibm-lin_tape-kmp-trace (i586)
    • ibm-lin_tape-kmp-vm (i586, x86_64)
  • vmware-tools-kmp-3.5.0_153875_2.6.27.21_0.1-0.1 (SRPM)
    • vmware-tools-kmp-debug (i586, x86_64)
    • vmware-tools-kmp-default (i586, x86_64)
    • vmware-tools-kmp-pae (i586)
    • vmware-tools-kmp-trace (i586, x86_64)
    • vmware-tools-kmp-vmi (i586)
    • vmware-tools-kmp-xen (i586, x86_64)

Novell KMP: Useable version of ibm-rdac-ds4000

After some more tinkering, a lot more looking at the macros in /usr/lib/rpm/rpm-suse-kernel-module-subpackage and /usr/lib/rpm/suse_macros, I think I finally have a usable RPM’ified version of IBM’s Multipathing driver ready for use.

There is still one major annoyance left: each time you install a new ibm-rdac-ds4000-kmp RPM, you also need to reinstall the corresponding ibm-rdac-ds4000-initrd package, as the macros in /usr/lib/rpm don’t allow for custom %post or %postun.

As mentioned before, I’m gonna send them to LSI/IBM for review, and maybe, MAYBE they are actually gonna make use of that.

Without further delay, here’s the list of packages. Just a short explanation: you need mppUtil-%version, in order to install the ibm-rdac-ds4000-kmp.

  • mppUtil-09.03.0C05.0030-0.2 (i586, x86_64, SRPM)
  • ibm-rdac-kmp-09.03.0C05.0030_2.6.16.60_0.37_f594963d-0.2 (SRPM)
    • ibm-rdac-kmp-bigsmp (i586)
    • ibm-rdac-kmp-debug (i586, x86_64)
    • ibm-rdac-kmp-default (i586, x86_64)
    • ibm-rdac-kmp-kdump (i586, x86_64)
    • ibm-rdac-kmp-kdumppae (i586)
    • ibm-rdac-kmp-smp (i586, x86_64)
    • ibm-rdac-kmp-vmi (i586)
    • ibm-rdac-kmp-vmipae (i586)
  • ibm-rdac-ds4000-initrd-09.03.0C05.0030_2.6.16.60_0.37_f594963d-0.2
    • ibm-rdac-initrd-bigsmp (i586)
    • ibm-rdac-initrd-debug (i586, x86_64)
    • ibm-rdac-initrd-default (i586, x86_64)
    • ibm-rdac-initrd-kdump (i586, x86_64)
    • ibm-rdac-initrd-kdumppae (i586)
    • ibm-rdac-initrd-smp (i586, x86_64)
    • ibm-rdac-initrd-vmi (i586)
    • ibm-rdac-initrd-vmipae (i586)

This package should be usable with System Storage DS4000 as well as System Storage DS3000 (they use the exact same source code).

I also know, that this solution isn’t really perfect. I’ve been looking at the %triggerin/%triggerun macros, but right now I can’t draw up a scenario (an easy one at that) to successfully use triggers in this situation. Only idea coming up looks like this:

  1. Put the triggers into ibm-rdac-ds4000
  2. When installing the kernel module packages, write the kernelversion/-flavor into a temporary file (impossible, since the macros don’t let you influence %post), and then let the trigger create/update the MPP initrd

If anyone knows a better solution (as in easier, without the writing to a separate file), I’m all ears.

Novell KMP: KMP’ing IBM’s RDAC driver

Well, after yesterday’s lesson about getting the IBM RDAC to install for a not-yet-running kernel, I decided to take it a step further. Novell does have some documentation about KMP’s, which is actually rather good, especially the guide written by Andreas Grünbacher.

After a short tinkering, I got it actually working. I was kinda surprised, at how easily it actually is. One problem I still have to deal with, is modifying the %post, to generate the mpp-initrd image. For now, the KMP only contains the default %post, which updates the modules.* stuff.

Now, I’m kinda asking myself, why don’t more vendors submit their drivers to Novell in form of KMP’s … Anyway, I’m gonna send mine the LSI/IBM way, maybe they’ll pick it up …

IBM RDAC: Installing the driver for a (not yet) running version

Well, kernel updates on our Linux servers running IBM’s RDAC driver (developed by LSI) is a real pest .. especially if you have to reboot the box two times in order to install the drivers/initrd correctly.

So I sat down and looked at the Makefile. Turns out, it just needs four tweaks in order to be working with a different kernel version (which you have to pass using environment variables to make).

After that, a simple make KERNEL_OBJ=/lib/modules/2.6.16.60-0.37_f594963d-smp/build OS_VER=2.6.16.60-0.37_f594963d-smp install correctly installs the modules in /lib/modules, rebuilds the correct modules dependencies and builds the correct initrd image.

TSM: Restoring the database/recovery log to a point-in-time

Well, my co-worker just called on my cell (it’s Friday, 16:00), and asked me which start-up script he needed to change in order to restore the database. My first response was, “ummm, that’s gonna be hard, we’re using heartbeat”.

Okay, so after a bit of asking I got out of him what he wanted to achieve by changing the start-up script. Apparently he did something to crash Tivoli Storage Manager (or rather repeatedly crash it) and wanted to restore the database. He talked to one of the systems partner we do have (and I’m happy we have them most of the time), who in return told him how to do it, but forgot a minute after he hung up the phone.

So, I went digging while he still was telling me how he got Tivoli to kick his own ass … After a bit, I thought “hrrrrrm, shouldn’t this be covered in the Tivoli documentation ?”, and surprisingly it’s actually covered in the documentation.

It’s actually rather simple.

  1. Stop the dsmserv Linux-HA cluster service (tsm-control ha stop tsm1)
  2. Setup the environment (since we’re running multiple instances of Tivoli Storage Manager – export DSMSERV_DIR, export DSMSERV_CONFIG)
  3. Enter the path of the server
  4. Run dsmserv restore db
  5. Wait some time (took about half an hour to restore the 95G database and the 10G recovery log)
  6. Start the dsmserv Linux-HA cluster service (tsm-control ha start tsm1)
  7. Update the server-to-server communication, since the restore db changes the communication verification token

Nagios: Service Check Timed Out

Since I got the pleasure of watching some Windows boxen with Nagios, I took the Windows Update plugin from Michal Jankowski and implemented it. It took me some time, to initially set up the nsclient++ correctly so it just works, but up till now the check plugin sometimes reported the usual “Service Check Timed Out”.

Usually I ended up increasing the cscript timeout, or the nsclient++ socket timeout, but it still kept showing up. Since I rely heavily on my surveillance tools, I have the demand, that as few as possible false positives show up. So I ended up chasing down this error today, and after that I have to say it was quite simple.

In my case, it wasn’t cscript (that timeout is set to 300 seconds), neither nsclient++ (socket timeout is set to 300 seconds too), nor the nrpe plugin itself (that has 300 seconds as well).

As it turns out, Nagios got an additional setting controlling these things, called service_check_timeout which defaults to 60 seconds. Sadly the plugin, or rather Windows needs longer than those 60 seconds to figure out whether or not it needs updating, thus Nagios is killing the plugin and returning a CRITICAL message.

After increasing the value of service_check_timeout that’ll be fixed hopefully.

SLES10: zypper.log

Well, I just stumbled upon something .. My Nagios at work wasn’t working anymore, and I went looking.

After that, zip – nada. Next thing, check whether or not the device is really full … Okay, df ..

So, it is actually completely filled up. So, now we need to find who’s hogging the space. Since I had a assumption (pnp4nagios), I went straight for /var/lib …

That wasn’t it .. so heading to the next place, that’s suspicious most of the time, /var/log.

I was like “WTF ? 5.2G for YaST2 logs ?” when I initially saw that output … As of now, I got a crontab emptying /var/log/YaST2 every 24 hours …

Nagios: SNMP OID’s for IBM’s RSA II adapter

Well, after some poking around I finally found some OID’s for the RSA’s (only through these two links: check_rsa_fan and check_rsa_temp).

For Nagios, I dismissed the fans, since the fan speed is only passed on in percent values. So I only added this:

Oh, and if anyone else is curious like me, here’s the list with the OID’s, courtesy of Gerhard Gschlad and Leonardo Calamai.

For the fans:

And for the temperatures:

I just found a proper list of OID’s for the IBM RSA adapter. That’s rather nice, since I really was looking for the OID’s for the VRM failure OID and other warning/critical events.