Relaxing weekend

There I was, enjoying our new patio and trying to get a tan outside (we had plenty great weather during the weekend).

The new patio
The new patio

Now, all of the sudden there was this *weird* noise I couldn’t classify (which I heard while listening to my iPod completely cranked up …), so I stood up.

Combine harvester mowing down rape
Combine harvester mowing down rape

They were finally mowing the rape field behind our house … hah!

SUSE Linux Enterprise Server 10 on VMware ESX (continued)

Well, after some searching today (we applied the VMware Update 2 today, thus the VMware Tools update too), I finally found out what is causing that problem.

Though the problem seems to be not limited to virtual systems alone, I just browsed through this Novell Forum thread which pretty much describes my problem. I found the same error in the VM’s I tried to mount a CD image.

Only difference between my behaviour and the one described, is that the virtual maschine is switched off immediately after you try to mount a CD image.

Now, this guy is saying Novell is working on it … But you’re gonna have to ask the question, why in gods name did such an update get through QA ? Or ain’t there no QA for updates ? *shrug*

Microsoft Cluster Services powered by IBM

If you think back, I talked about my problems with MSCS while utilizing the IBM RDAC Multipath driver for Windows.

Everyone I talked to about this, including our IBM business partner and it’s systems engineers; as well as some IBM systems engineer (who in fact was an freelance guy hired by IBM), told me it had to do with how we did the zoning (stuffing every controller into a single zone), and that would be the reason why the x3650 was seeing that many drives.

When the freelance SE came to visit us, we redid the zoning, separating each endpoint connection (each HBA port to each controller port) into a different zone.

Additionally he told me, that was the only IBM™ supported configuration.

SAN Zoning (Overview)
SAN Zoning (Overview)

As you can see, I had to create ten different zones for each single port of the dual port fibre channel HBA and it’s corresponding endpoint (I guess, I still have to create more, since the DS4700 is having *two* ports per controller).

SAN Zoning (Detailed)
SAN Zoning (Detailed)

After we finished that, we rebooted the x3650 and hoped that would have fixed. Afterwards the IBM SE was baffled. Still seeing ~112 devices. What the heck ? He ranted about how awful this was and did some mumbo jumbo with his notebook, uploaded the ds4?00 configuration files to some web interface, but shortly afterwards said the storage configuration seemed to be fine on the first glance.

So we had another look at the storage configuration and he quickly found, that the other cluster ports were set to “Windows Cluster 2003 (Supporting DMP)” in the port configuration and said that’d be the cause why stuff still ain’t working (I think he guessed wildly, since he had no clue either). After I told him, I just can’t change those ports right now (since the remaining part of the cluster is in full production), we agreed that I’d do it some other time and tell him about my results.

Anyways, the next day my co-workers suggested, trying a newer Storage Manager version on the x3650, at the same level with the highest firmware version on the storages (thus being the DS4700 and v09.23). Now guess what ?

That fucking works. The cluster is still behaving weird sometimes (now the other boxen seem to have trouble bringing resources online, but only sometimes).

So here my hint: Always keep an old version of the Storage Manager around, you can’t get them from IBM anymore *shrug*

Cascading Style Sheets are really weird

So here I was, sitting around and thinking about formatted classes for my paragraphs. Now the result is quite pleasing, but has some side effects. But see for yourself …

Messed up CSS
Messed up CSS

As you can see, the browser is reusing the background-image URL from the <p> element within the <a> element, even though the <a> element initially had none. Even defining putting an background-image: none; into the <a> class doesn’t get me anywhere.

The weird thing is only Firefox is displaying it this messed up (not so weird when you think about how IE 6 treats standards). So if any of you CSS wiz’ got a suggestion, I’m listening 🙂

Thanks to the tip of Dave(?), the issue is fixed now!

Connecting to a remote console with MSTSC 6.0.6001

Well, as one can read in about every damn post you can find for that topic, the /console switch is now silently ignored, as well as the rdp file option “connect to console:i:1“.

Now, what you don’t find anywhere (only in some scenario explanation), that it is allowed to specify the mode (ie /console previously and now /admin) within the full address parameter.

Scenario: In the RDC client UI, you specify Computer_name /console in the Computer box (where Computer_name represents the name of the remote computer to which you want to connect), and then click Connect.

Behavior: The /console switch is silently ignored. You will be connected to a session to remotely administer the server. (For more information about the Windows Server 2008 behavior, see the “Behavior when you connect to a server that does not have Terminal Server installed” section of this article.)

So my rdp connection file basically looks like this:

SUSE Linux Enterprise Server 10 on VMware ESX

We’re currently having a *really* weird problem with our VM’s. Sometime last week, SUSE released a kernel update. Now, once you install it and you reboot the selected VM with a DVD/CD image present, you’re gonna see this:

msg.vmxaiomgr.retrycontabort.unkown
msg.vmxaiomgr.retrycontabort.unkown

The only workaround so far has been to unmount *any* cleanse any CD-Drives attached to the VM. And yes, this is reproduceable, even reinstalling from scratch doesn’t change the fact, that after installing the patch the VM quits working.

I also know, SLES10 SP2 ain’t officially supported yet by VMware, but I’d still suspect it to just work and not produce such weird errors. The only thing I found so far is this VMTN thread ..

Lucky us, VMware just today released Update 2 for VirtualCenter and ESX, wherein SLES10SP2 should be officially supported!

Nagios Hostgroup Inheritance (continued)

Well, it turns out that my thought was ultimativly flawed. When defining the hostgroup_members in the lower tiers, nagios is association the checks from the lower tier with the upper tiers. Thus propagandating all checks upwards, and me ending up with ~250 checks instead of ~150.

Gonna have to try to define the dependency backwards, maybe that’ll help. But that’s a topic for Monday. Guess I’ll finish viewing Ghost in the Shell – Stand Alone Complex first.

Nagios Hostgroup Inheritance

As I wrote earlier, I recently virtualized our nagios. Along with that came a complete “redesign” of how checks are applied. Up till now, I defined checks for each and every single server, thus ending up with ~25 files, each holding roughly 6 checks which are in the same file just sorted by hostname.

As you can imagine, it gets quite confusing with that amount of checks (~150). So the last two days I spent on reorganizing (with Visio), on which object/hostgroup placing a check would make sense. Now, this is my first result of two days planning, reorganizing, reordering and moving hosts into different hostgroups.

Nagios Hostgroup Inheritance - Linux
Nagios Hostgroup Inheritance – Linux
Nagios Hostgroup Inheritance - Windows
Nagios Hostgroup Inheritance – Windows
Thanks to Josh (and Chris I think), realizing the above is gonna get quite easy. Gonna talk about the config layout itself about once I have it all wrapped up. Stay tuned!

Nagios virtualization

As virtualization seems to be a trendy thing to do, I went ahead and virtualized our nagios (while reinstalling the whole thing …).

Now as I went into work today and started my email client, I received 4 nagios warnings about a LOAD service reaching critical state. Looked at the nagios box itself, opened up the VM console, looked into the syslog. Nothing.

Yet over 3/4 of the services were flapping, some ping checks were critical (for whatever reason). So I opened the nagios webinterface again, and noticed it dropping the connection over and over again (had to reauthentificate me again and again).

So I opened up Putty, which established the connection without a single problem, but dropped me like a stone after a short amount of time. I restarted the session and got a security warning from Putty (due to different than the saved sshd public key). That raised my suspicion. So I took a look at the hostname, and lookie there.

Somehow my old nagios box (which is a physical box), got turned online again, thus having the same IP address as my virtualized one. So the virtualized nagios wasn’t really dropping my connection, but I was being directed to the old nagios.

Walked over into the data center, turned of the old box (well, I kept the power button pressed for a short time), and away went my troubles.

Extending VMotion compatiblity (continued)

Remember my last post about cpu masking ? Well, turns out that you can do it to a “template”.

The only point you don’t need to do, is to mark the VM as a “template“. You still can clone it and move it around and all that other stuff, but the good part is, that the cloned VM keeps the cpu mask set to the “template*shrug*

I don’t know, why VMware didn’t include that feature into the templates, since it’s a real freaky way to do.