As I pointed out in my pastposts, I was having some weird errors with SLES10 regarding mounting CD images inside the guest (as well, as it turned out, on the physical hardware).
Now, finally, after about a week or so Novell finally released an updated kernel package today. So the error per se is fixed, I can use my CD drives again, as well as update my virtual machines by means of virtual center, without the trickery of copying the linux.iso from /vmimages/tools to the guest and mount it by means of mount -o loop.
Today I was moving a pretty standard SLES10 virtual machine to another host, when the migration dialog showed me this:
And if you now think, the virtual machine is something special take a look at those settings:
I don’t know what to think about that error message. Googling for it doesn’t reveal that much about it. If anyone out there got an idea, I’m open for suggestions.
Since I do happen to be in the situation pretty often where the kernel inside a VM is newer than what VMware currently has in their tools (as in the SUSE kernel is newer than the binary modules built by VMware), here’s a quick reminder for myself on how to to fix the .ko symlinks.
Well, after some searching today (we applied the VMware Update 2 today, thus the VMware Tools update too), I finally found out what is causing that problem.
Though the problem seems to be not limited to virtual systems alone, I just browsed through this Novell Forum thread which pretty much describes my problem. I found the same error in the VM’s I tried to mount a CD image.
1
2
kernel:ide-cd:weird block size524288
kernel:ide-cd:defaultto2kbblock size
Only difference between my behaviour and the one described, is that the virtual maschine is switched off immediately after you try to mount a CD image.
Now, this guy is saying Novell is working on it … But you’re gonna have to ask the question, why in gods name did such an update get through QA ? Or ain’t there no QA for updates ? *shrug*
Everyone I talked to about this, including our IBM business partner and it’s systems engineers; as well as some IBM systems engineer (who in fact was an freelance guy hired by IBM), told me it had to do with how we did the zoning (stuffing every controller into a single zone), and that would be the reason why the x3650 was seeing that many drives.
When the freelance SE came to visit us, we redid the zoning, separating each endpoint connection (each HBA port to each controller port) into a different zone.
Additionally he told me, that was the only IBM™ supported configuration.
As you can see, I had to create ten different zones for each single port of the dual port fibre channel HBA and it’s corresponding endpoint (I guess, I still have to create more, since the DS4700 is having *two* ports per controller).
After we finished that, we rebooted the x3650 and hoped that would have fixed. Afterwards the IBM SE was baffled. Still seeing ~112 devices. What the heck ? He ranted about how awful this was and did some mumbo jumbo with his notebook, uploaded the ds4?00 configuration files to some web interface, but shortly afterwards said the storage configuration seemed to be fine on the first glance.
So we had another look at the storage configuration and he quickly found, that the other cluster ports were set to “Windows Cluster 2003 (Supporting DMP)” in the port configuration and said that’d be the cause why stuff still ain’t working (I think he guessed wildly, since he had no clue either). After I told him, I just can’t change those ports right now (since the remaining part of the cluster is in full production), we agreed that I’d do it some other time and tell him about my results.
Anyways, the next day my co-workers suggested, trying a newer Storage Manager version on the x3650, at the same level with the highest firmware version on the storages (thus being the DS4700 and v09.23). Now guess what ?
That fucking works. The cluster is still behaving weird sometimes (now the other boxen seem to have trouble bringing resources online, but only sometimes).
So here my hint: Always keep an old version of the Storage Manager around, you can’t get them from IBM anymore *shrug*
Well, as one can read in about every damn post you can find for that topic, the /console switch is now silently ignored, as well as the rdp file option “connect to console:i:1“.
Now, what you don’t find anywhere (only in some scenario explanation), that it is allowed to specify the mode (ie /console previously and now /admin) within the full address parameter.
Scenario: In the RDC client UI, you specify Computer_name /console in the Computer box (where Computer_name represents the name of the remote computer to which you want to connect), and then click Connect.
Behavior: The /console switch is silently ignored. You will be connected to a session to remotely administer the server. (For more information about the Windows Server 2008 behavior, see the “Behavior when you connect to a server that does not have Terminal Server installed” section of this article.)
So my rdp connection file basically looks like this:
We’re currently having a *really* weird problem with our VM’s. Sometime last week, SUSE released a kernel update. Now, once you install it and you reboot the selected VM with a DVD/CD image present, you’re gonna see this:
The only workaround so far has been to unmount *any* cleanse any CD-Drives attached to the VM. And yes, this is reproduceable, even reinstalling from scratch doesn’t change the fact, that after installing the patch the VM quits working.
I also know, SLES10 SP2 ain’t officially supported yet by VMware, but I’d still suspect it to just work and not produce such weird errors. The only thing I found so far is this VMTN thread ..
Lucky us, VMware just today released Update 2 for VirtualCenter and ESX, wherein SLES10SP2 should be officially supported!
Well, it turns out that my thought was ultimativly flawed. When defining the hostgroup_members in the lower tiers, nagios is association the checks from the lower tier with the upper tiers. Thus propagandating all checks upwards, and me ending up with ~250 checks instead of ~150.
Gonna have to try to define the dependency backwards, maybe that’ll help. But that’s a topic for Monday. Guess I’ll finish viewing Ghost in the Shell – Stand Alone Complex first.
As I wrote earlier, I recently virtualized our nagios. Along with that came a complete “redesign” of how checks are applied. Up till now, I defined checks for each and every single server, thus ending up with ~25 files, each holding roughly 6 checks which are in the same file just sorted by hostname.
As you can imagine, it gets quite confusing with that amount of checks (~150). So the last two days I spent on reorganizing (with Visio), on which object/hostgroup placing a check would make sense. Now, this is my first result of two days planning, reorganizing, reordering and moving hosts into different hostgroups.
Thanks to Josh (and Chris I think), realizing the above is gonna get quite easy. Gonna talk about the config layout itself about once I have it all wrapped up. Stay tuned!
As virtualization seems to be a trendy thing to do, I went ahead and virtualized our nagios (while reinstalling the whole thing …).
Now as I went into work today and started my email client, I received 4 nagios warnings about a LOAD service reaching critical state. Looked at the nagios box itself, opened up the VM console, looked into the syslog. Nothing.
Yet over 3/4 of the services were flapping, some ping checks were critical (for whatever reason). So I opened the nagios webinterface again, and noticed it dropping the connection over and over again (had to reauthentificate me again and again).
So I opened up Putty, which established the connection without a single problem, but dropped me like a stone after a short amount of time. I restarted the session and got a security warning from Putty (due to different than the saved sshd public key). That raised my suspicion. So I took a look at the hostname, and lookie there.
Somehow my old nagios box (which is a physical box), got turned online again, thus having the same IP address as my virtualized one. So the virtualized nagios wasn’t really dropping my connection, but I was being directed to the old nagios.
Walked over into the data center, turned of the old box (well, I kept the power button pressed for a short time), and away went my troubles.