Nagios and check_ram yet again

As some people know, I previously “created” (mostly modified the check_swap plug-in to print RAM usage) check_ram in C. Now one of my problems for the past few months was putting the C plug-in as well as “supported” environment under the same hat. Today I had another look at the amount of available plug-ins in NagiosExchange. There are quite a few plug-ins available, but as I do have some experience with Python, I used the one written in Python.

It was rather easy hacking in support for performance data into it, as the below shows. Someone else already posted a non-unified diff for performance data support, but that ain’t quite right according to the Nagios plug-in development guidelines.

Read More

IBM TS7530 engine failover and HBA mode

Well, when they delivered the VTL about four weeks ago, nobody figured this thing would be such a mess. Apparently IBM hasn’t set up that much VTL’s with engine failover.

Point being, the VE’s have eight HBA ports (four inside, four outside the black box). Now, as they configured the VTL, the ports were all in initiator mode. And we needed the fourth port in target mode as well, as it’s better to have 4 independent paths to the VTL. The only problem was, the VE console didn’t think so.

There is no way in hell you can switch the darn HBA port to the target mode. — Well, IBM just called and told us the solution.

Disolve the Failover group, reconfigure the HBA port and then recreate the Failover group. Tada …..

IBM TS7530 zoning

At first, as we prepped the zoning for the VTL, we did it WWN-based. Now the trouble with the HBA’s of the VTL is simply that it has different WWPN’s on the same WWN. And WWN-based zoning simply doesn’t allow access to that.

So off we went and switched to Switchport-based zoning, and see. It just works *shrug*

MessPC Ethernetbox 2 and Nagios

As I talked to Tobi yesterday, we came to talk about our Ethernet Box thermometer. It’s a neat device, which works pretty much out of the box. Integrating it with Nagios is a bit of a bummer.

Ethernetbox 2
Ethernet box 2

That’s what the ~300 EUR box looks like. It’s basically a small black box with a RJ45 jack, and four RJ11 jacks for attached external devices. The box itself only functions as a “management station” and doesn’t come with a sensor.
Normally, you can attach up till four RJ11 sensors to it. But, MessPC also has RJ11 port splitters, which enables you to attach up to eight RJ11 sensors to the MessPC.

Thermometer RJ45 jacks
Thermometer RJ45 jacks

As you can see, the box has a RJ45 jack on the other side, which you basically hook up to your network and then configure an IP address (or if you fancy DHCP for those things, it’s possible too).

Thermometer RJ11 jacks
Thermometer RJ11 jacks

On the opposite site, are the RJ11 jacks for the sensors. As you can see, we currently do have 4 splitters attachted to the box, enabling up till 8 sensors to be measured.
Once you have it up and running, you can look at the web interface and you’ll be able to see the state of the sensors right on the first page.
Read More

Adapter bonding on Linux

Well, today I had a rather weird error. I was testing the adapter bonding on one of the boxen designated as Tivoli Storage Manager Server, when I noticed that the bonding wasn’t working as expected when simulating an error (that is unplugging one of the TP cables for the bond).

Now, the bond had “mode=6 miimon=100” as options. After running “linux bond debug” through Google (which turned up nothing useful, besides one document on the Oracle Wiki about IOS/Linux adapter teaming), I figured “Hey, just lets test switching the arguments.” And guess what ?

Afterwards, it just works when you unplug one of the cables of the bond, while it didn’t work before … *shrug*

Windows Server 2003 Terminal services

Well, once you thought you don’t have any more problems, another one just pops up. I’m currently bashing my head against the wall, why the hell the forwarded (or is it redirected ?) drives are not shown in the in the “My Computer” explorer view. I pretty sure have an idea why (basically, HKEY_CURRENT_USERSSoftwareClasses isn’t writeable, but that’s where Windows, or rather the Terminal Services — or whatever is creating the associations), just don’t know a clever way around/by it.

It’s basically a dead end. The user has no access to that particular subkey, and I can’t change the permissions by changing it in ntuser.dat apparently. Neither do the inherited permissions apply, so I’m basically stuck. πŸ™

Linux-HA and Tivoli Storage Manager (Finito!)

As I previously said, I was writing my own OCF resource agent for IBM’s Tivoli Storage Manager Server. And I just finished it yesterday evening (it took me about two hours to write this post).

Trac revision log (shortened)
Trac revision log (shortened)

Only took me about four work days (that is roughly four hours each, which weren’t recorded in that subversion repository) plus most of this week at home (which is 10 hours a day) and about one hundred subversion revisions. The good part about it is, that it actually just works πŸ˜€ (I was amazed on how good actually). Now you’re gonna say, “but Christian, why didn’t you use the included Init-Script and just fix it up, so it is actually compilant to the LSB Standard ?”

The answer is rather simple: Yeah I could have done that, but you also know that wouldn’t have been fun. Life is all about learning, and learn something I did (even if I hit the head against the wall from time to time πŸ˜‰ during those few days) … There’s still one or two things I might want to add/change in the future (that is maybe next week), like

  • adding support for monitor depth by querying the dsmserv instance via dsmadmc (if you read through the resource agent, I already use it for the shutdown/pre-shutdown stuff)
  • I still have to properly test it (like Alan Robertson mentioned in his one hour thirty talk on Linux-HA 2.0 and on his slides, Page 100-102) in a pre-production environment
  • I’m probably configure the IBM RSA to act as a stonith device (shoot the other node in the head) – just for the case one of them ever gets stuck in a case, where the box is still up, but doesn’t react to any requests anymore

Read More

Setting up Linux-HA

Well, initially I thought writing the OCF resource agent for Tivoli Storage Manager was the hard part. But as it turns out, it really ain’t. The hard part, is getting the resources into the heartbeat agent (or whatever you wanna call it). The worst part about it, is that the hb_gui is completely worthless if you want to do a configuration without quorum.

First of all, we need to setup the main Linux-HA configuration file (/etc/ha.d/ha.cf). Configuring that, is rather simple. For me, as I do have two network devices, over which both nodes see each other (one is an adapter bond of comprising of two simple, plain, old 1G copper ports; the other is the 1G fibre cluster port), the configuration looks like this:

After configuring the service itself is done, one just needs to start the heartbeat daemon on both nodes. Afterwards, we should be able to configure the cluster resources.

I find it particularly easier to just update the corresponding sections with cibadmin (the man-page really has some good examples). So here are my configuration files for two resource groups (crm_mon doesn’t difference between resources and grouped resources, it’ll just show you that you configured two resources).
Read More

Subversion via HTTP(s) and mod_rewrite

Well, I just finished my wild-goose chase with Apache and subversion regarding a rather weird error. I recently reinstalled our subversion box, and ever since then I was unable to commit anything new to any of the repositories.
Subversion told me this:

Apache didn’t say much about it either, besides this particular line:

Today I sat down and thought really hard, what exactly was different from before.

  1. Installed Trac instead of Redmine, but that can’t have anything to do with the error
  2. Configured URL rewriting …

As it turns out, the following RewriteRule was the cause:

After changing the Rewrite Rule (as showed below, compare the difference yourself πŸ˜› ), it works just like a charm.

Hint to self: whenever encountering HTTP 302 in conjunction with Subversion, check the RewriteRule’s ❗

Linux-HA and Tivoli Storage Manager

Well, since we received part of our shipment on Wednesday, I finally looked at how we’re gonna deploy our active/active Tivoli Storage Manager configuration. Right now, we do have a single pSeries box hosting ~100 client nodes which we’re looking to split by two (since we do have two x366 for that purpose now).
Now, as there ain’t no solution for this scenario yet (neither from International Business Machines nor someone out of the open source community), I sat down and started writing an OCF Resource agent for dsmserv (that is the Tivoli Storage Manager server).
At first I had a bit trouble adjusting myself on how stupid/non-standard dsmserv is, but after reading through the Storage Manager Installation handbook (on multiple installations on a single server) and through some peoples notes on multiple deployments of Tivoli Storage Manager on the same server, I think I managed to get my head around it.
I still think the resource agent lacks some real testing (I put a two node cluster online on Tuesday, but that is non-productive), but that’ll happen soon.

As you can see, I reworked the “stop” phase, to first terminate all running processes and then dismount all tapes in order to avoid data corruption (that was an advice from our friendly IBM systems engineer); if that fails, try terminating it by a “friendly” kill (SIGTERM); and if that ain’t helping, kill it the “Die Hard Way”β„’ (SIGKILL).