New IBM RDAC version (or not)

A week ago (September 02nd), I received a mail detailing the release of IBM’s new multipathing device driver for the DS4x00 series, which finally works with SLES11 (the available software up till now doesn’t — as in fails with kernels > 2.6.26 iirc).

ESC+ notification detailing the release
ESC+ notification detailing the release

There wouldn’t be any trouble, if IBM (or rather the vendor providing the driver — LSI) would actually release the driver … up till today, I have yet to see the new version appear on the download page. I already tried to notify IBM about the trouble, but as usual there is lack of ways to actually get this to the right person.

Well, IBM just replied to my feedback and apparently the download is available (it is right now, after two weeks hah — finally).

Tivoli Storage Manager Server 5.5.3

I spent yesterday afternoon upgrading our TS7530, and in my fad I also upgraded TSM to 5.5.3. Now, once I started TSM it quickly started complaining about the paths to the drives.

I thought maybe this is a mere device problem (we have had them before), so I rebooted the boxes. But still no luck and I went home after about an hour of trying without any luck. In the morning, my co-worker called our trustworthy IBM service partner, and the TSM consultant said he had the exact, same problem yesterday. We would have two options:

  1. Enable the option SANDISCOVERY, with the (completely undocumented) Passive setting (setopt SANDISCOVERY PASSIVE)
  2. Downgrade back to 5.5.2

For now, we implemented the first option, in the hope that’ll solve our troubles. And it actually does.

Mass-updating Tivoli Storage Manager drive status

I was fighting with our VTL again, and TSM was thinking all the drives were offline. In order to update the drive status, you’d need to go into the ISC and select each drive and set them to ONLINE. Since I’m a bit click-lazy, I wrote a simple nested for-loop, which gives me the output to update all the drives at once:

Result is a list like this:

The same goes for mass-updating the path status:

Result is a list like this:

IBM RSA II adapter and Java RE (fini)

If you remember back to July, I looked into some troubles I had with the IBM RSA II adapter’s Java interface and the latest JRE updates. I just noticed, that IBM released a new firmware yesterday for the RSA. The ChangeLog states this:

Version 1.13, GFEP35A
Problem(s) Fixed:

* Suggested
o Fix for Remote Control General Exception in JRE 1.6 update 12 and above.
o Corrected a problem that DHCP renew/release may fail after a long time.
o Corrected a problem that remote control preference link disapears after creating new key buttons.
o Corrected a problem that cause event number shows only from 0 to 255 when views RSA log via telnet session.

As you can see, IBM finally decided that it isn’t a Sun problem but rather their own! Finally, after about 4 months a fix, yay!

Even if the fix is just for the x3550 for now, but that puts a light to the end of the tunnel and puts up hope, that they are gonna fix it for the other RSA adapters too!

rpc.statd starting before portmap

One problem gone, another one turns up. When rpc.statd (nfs-common) tries to start before portmap, it’s gonna result in failure. Now, the logfile (/var/log/daemon.log) is gonna print a rather cryptic error message:

After fixing the start order (I really hate *SUSE*/Debian* for not having init-script dependencies — like Gentoo’s baselayout/Roy’s openrc does have), everything is like it should be and I’m able to put the /srv/xen mount into the fstab

portmap hanging on shutdown

Here’s yet another post about my compute cluster. It’s (obviously) running NFS and that works quite well. Up till now, I would always have trouble with portmap hanging on shutdown/reboot. After spending some time thinking about the problem, looking at the init script and googling, I stumbled upon this Ubuntu bug on portmap.

As noted in the bug, a pmap_dump would hang indefinitely. After taking another look at our nfs-root configuration (in regard to the first comment on the bug), it turns out it’s exactly that. We didn’t setup lo which seems vital for some things.

After adding the lines

to /etc/network/interfaces, portmap stops just fine …

OFED packages for Debian

As I mentioned yesterday, I’m currently doing some project work. Said project includes InfiniBand technology.

Apparently we bought a “cheap” InfiniBand switch, which comes without a subnet manager. So, in order to communicate between the nodes, you need to install the subnet manager (opensm in my case) on each node.

In order to utilize the InfiniBand interface you need to do a few things first though:

  1. Obviously install the opensm package
  2. Add ib_umad and ib_ipoib to /etc/modules

After installing opensm on the host as well as the NFS root, opensm comes up just fine and the network starts automatically. Only trouble right now is, that ISC’s DHCP doesn’t support InfiniBand, otherwise I could even utilize DHCP to distribute the IP addresses.

Xen dom0 failing with kernel panic

I’m building a 6-node cluster, using Xen at the moment. For the last few days, I tried my setup in a virtual machine, simply because VM’s boot much faster than the real hardware. However, certain things you can only replicate on the real hardware (for example, the InfiniBand interfaces, as well as certain nfs-stuff).

So I spent most of the day to replicate my configurations onto the hardware. After getting all done, the moment of the first boot … kaput! Doesn’t boot, just keeps hanging before booting the real kernel. Now what ? I removed the Xen vga parameters and rebooted (waited ~2 minutes in the process) until I finally saw the root cause for my trouble:

I was like *wtf* … My tftp setup _worked_ inside the VM’s, why ain’t it working here ? Quick look at the pxelinux.cfg for the mac address revealed this:

As you can see, I had devised 64M for the dom0, which apparently wasn’t enough. After tuning the memory limit to 256M, everything is honky-dory!

TS7530 authentification failure

Today, I had a rather troublesome morning. Once I got to work, Nagios was already complaining about the lin_taped on one of our TSM servers, which apparently failed due to too many SCSI resets. Additionally, I can’t login using the VE console (I can login however using SSH) so I ended up opening up a IBM Electronic Service Call (ESC+).

Using SSH, I can get some information on the VE’s status:

After looking a bit deeper, it seems that none of the two TSM server is able to see the IBMchanger devices for the first VTL. The second is perfectly visible, just not the first. After putting both VE nodes into suspended failover, gathering support data for the IBM support from both VE’s and the Brocade SAN switches, apparently everything works again. I guess the library does have “self healing” properties.

VMware vSphere and templates

I just converted one of my (old) templates, as I wanted to refresh the updates and the virus scanner. After converting, I was asked about the UUID (no clue why), and expected to be done with it. But after looking at the console, I got the following, completely cryptic message:

Unable to connect to MKS
Unable to connect to MKS

After digging a bit deeper (that is looking at the vmware.log of the virtual machine, since the message of the GUI is *real* cryptic), I’m a bit wiser:

After softly shutting the VM down, and the powering the VM back up everything is back to working order.