Linux-HA and Tivoli Storage Manager (Finito!)

As I previously said, I was writing my own OCF resource agent for IBM’s Tivoli Storage Manager Server. And I just finished it yesterday evening (it took me about two hours to write this post).

Trac revision log (shortened)
Trac revision log (shortened)

Only took me about four work days (that is roughly four hours each, which weren’t recorded in that subversion repository) plus most of this week at home (which is 10 hours a day) and about one hundred subversion revisions. The good part about it is, that it actually just works 😀 (I was amazed on how good actually). Now you’re gonna say, “but Christian, why didn’t you use the included Init-Script and just fix it up, so it is actually compilant to the LSB Standard ?”

The answer is rather simple: Yeah I could have done that, but you also know that wouldn’t have been fun. Life is all about learning, and learn something I did (even if I hit the head against the wall from time to time 😉 during those few days) … There’s still one or two things I might want to add/change in the future (that is maybe next week), like

  • adding support for monitor depth by querying the dsmserv instance via dsmadmc (if you read through the resource agent, I already use it for the shutdown/pre-shutdown stuff)
  • I still have to properly test it (like Alan Robertson mentioned in his one hour thirty talk on Linux-HA 2.0 and on his slides, Page 100-102) in a pre-production environment
  • I’m probably configure the IBM RSA to act as a stonith device (shoot the other node in the head) – just for the case one of them ever gets stuck in a case, where the box is still up, but doesn’t react to any requests anymore

Read More

Linux-HA and Tivoli Storage Manager

Well, since we received part of our shipment on Wednesday, I finally looked at how we’re gonna deploy our active/active Tivoli Storage Manager configuration. Right now, we do have a single pSeries box hosting ~100 client nodes which we’re looking to split by two (since we do have two x366 for that purpose now).
Now, as there ain’t no solution for this scenario yet (neither from International Business Machines nor someone out of the open source community), I sat down and started writing an OCF Resource agent for dsmserv (that is the Tivoli Storage Manager server).
At first I had a bit trouble adjusting myself on how stupid/non-standard dsmserv is, but after reading through the Storage Manager Installation handbook (on multiple installations on a single server) and through some peoples notes on multiple deployments of Tivoli Storage Manager on the same server, I think I managed to get my head around it.
I still think the resource agent lacks some real testing (I put a two node cluster online on Tuesday, but that is non-productive), but that’ll happen soon.

As you can see, I reworked the “stop” phase, to first terminate all running processes and then dismount all tapes in order to avoid data corruption (that was an advice from our friendly IBM systems engineer); if that fails, try terminating it by a “friendly” kill (SIGTERM); and if that ain’t helping, kill it the “Die Hard Way”â„¢ (SIGKILL).

zypper-update-report (was: patch2mail for SLES10)

Well, after some more refining I think I finally have a script I ain’t never gonna touch again (unless something breaks, which can happen quick as we all know).

The script now uses a sysconfig file for the common settings (like sender, receipents, categories to scan for), so it may be deployed en mass.

/etc/sysconfig/zypper-update-report

/usr/local/sbin/zypper-update-report

Debugging “rug”

Well, it’s 7pm. I’m sitting at home and thinking about why in gods name rug isn’t adding my update repository. I can add the service using yast inst_source, but when yast then syncs with ZenWorks, it tells me something like:

Failed to get repomd/repodata.xml; Reason: 530 – Access denied

So my fellow co-worker turned on the debug-logging and we quickly found out why: rug isn’t using the command line credentials I was passing.

Now I only need to find out, why rug isn’t using them, and how I’m able to pass username and password to rug .. Or not, after looking through the Novell community, I found bug 204741 in Novell’s bugzilla. Guess, what .. It’s marked WONTFIX (or whatever, I can’t view the duplicate bug).

Yet another VMware error

Today I was moving a pretty standard SLES10 virtual machine to another host, when the migration dialog showed me this:

fault.MemorySizeNotRecommended
fault.MemorySizeNotRecommended

And if you now think, the virtual machine is something special take a look at those settings:

Virtual machine configuration
Virtual machine configuration

I don’t know what to think about that error message. Googling for it doesn’t reveal that much about it. If anyone out there got an idea, I’m open for suggestions.

Fixing vmkernel symlinks

Since I do happen to be in the situation pretty often where the kernel inside a VM is newer than what VMware currently has in their tools (as in the SUSE kernel is newer than the binary modules built by VMware), here’s a quick reminder for myself on how to to fix the .ko symlinks.

SUSE Linux Enterprise Server 10 on VMware ESX (continued)

Well, after some searching today (we applied the VMware Update 2 today, thus the VMware Tools update too), I finally found out what is causing that problem.

Though the problem seems to be not limited to virtual systems alone, I just browsed through this Novell Forum thread which pretty much describes my problem. I found the same error in the VM’s I tried to mount a CD image.

Only difference between my behaviour and the one described, is that the virtual maschine is switched off immediately after you try to mount a CD image.

Now, this guy is saying Novell is working on it … But you’re gonna have to ask the question, why in gods name did such an update get through QA ? Or ain’t there no QA for updates ? *shrug*

SUSE Linux Enterprise Server 10 on VMware ESX

We’re currently having a *really* weird problem with our VM’s. Sometime last week, SUSE released a kernel update. Now, once you install it and you reboot the selected VM with a DVD/CD image present, you’re gonna see this:

msg.vmxaiomgr.retrycontabort.unkown
msg.vmxaiomgr.retrycontabort.unkown

The only workaround so far has been to unmount *any* cleanse any CD-Drives attached to the VM. And yes, this is reproduceable, even reinstalling from scratch doesn’t change the fact, that after installing the patch the VM quits working.

I also know, SLES10 SP2 ain’t officially supported yet by VMware, but I’d still suspect it to just work and not produce such weird errors. The only thing I found so far is this VMTN thread ..

Lucky us, VMware just today released Update 2 for VirtualCenter and ESX, wherein SLES10SP2 should be officially supported!

Nagios Hostgroup Inheritance (continued)

Well, it turns out that my thought was ultimativly flawed. When defining the hostgroup_members in the lower tiers, nagios is association the checks from the lower tier with the upper tiers. Thus propagandating all checks upwards, and me ending up with ~250 checks instead of ~150.

Gonna have to try to define the dependency backwards, maybe that’ll help. But that’s a topic for Monday. Guess I’ll finish viewing Ghost in the Shell – Stand Alone Complex first.

Nagios Hostgroup Inheritance

As I wrote earlier, I recently virtualized our nagios. Along with that came a complete “redesign” of how checks are applied. Up till now, I defined checks for each and every single server, thus ending up with ~25 files, each holding roughly 6 checks which are in the same file just sorted by hostname.

As you can imagine, it gets quite confusing with that amount of checks (~150). So the last two days I spent on reorganizing (with Visio), on which object/hostgroup placing a check would make sense. Now, this is my first result of two days planning, reorganizing, reordering and moving hosts into different hostgroups.

Nagios Hostgroup Inheritance - Linux
Nagios Hostgroup Inheritance – Linux
Nagios Hostgroup Inheritance - Windows
Nagios Hostgroup Inheritance – Windows
Thanks to Josh (and Chris I think), realizing the above is gonna get quite easy. Gonna talk about the config layout itself about once I have it all wrapped up. Stay tuned!