Microsoft Cluster on VMware and Devices

Well, once again the Microsoft Cluster on VMware bit my ass … As you might know, MSCS on VMware is a particular kind of pain, with each upgrade you end up with the same problem over and over again (SCSI reservations on the RDM-LUNs being one, and the passive node not booting being the other).

So I opened up another support case with VMware, and the responded like this:

Please see this kb entry: http://kb.vmware.com/kb/1016106

This doesn’t completely fit my case, but since the only active cluster-node failed yesterday evening (it’s only our internal file-share server, thus no worries), I thought I’d try to set the options.

And guess what ? My damn cluster works again 🙂

Nagios: Watching Clustered environments (the other way)

Well, recently I stepped up to watch our cluster environments … Michael has a good howto on how to watch Windows Cluster environments in the NSclient++ wiki.

Now, this has it’s own perks … Which I stumbled upon when trying to write a Linux-HA OCF resource agent for the Nagios NRPE server. Combining that Linux-HA with SLES10 is a good thing generally, but using startproc in that resource agent is not such a good idea.

Apparently Novell (or SuSE GmbH) thought it might be wise to include some additional logic into the wrapper. startproc, checkproc and killproc do check for the name of the executable. So if you try to start an additional process with the same name, you need to dig a bit deeper.

For this to work, you need two additional things (quotations directly from man 8 startproc):

-p pid_file
(Former option -f changed due to the LSB specification.) Use an alternate pid file instead of the default (/var/run/<basename>.pid). The pid read from this file is being matched against the pid of running processes that have an executable with specified path of the program. In order to avoid confusion with stale pid files, a not up-to-date pid will be ignored.

Now, then apparently this isn’t enough. startproc is still refusing to start a second process.

-i ignore_file
The pid found in this file is used as session id of the same binary program which should be ignored by startproc.

Read More

Tivoli Storage Manager Client and Microsoft Cluster Services

Well, I just had another look at our client scheduler services on our Microsoft Cluster. A while back we noticed that those scheduler services were going nuts after some time. Well, as it turns out, I can tell why. Microsoft Cluster Services have a feature called registration replication, which replicates a given key, if changed when the resource is online, to all connected cluster nodes.

Now, we added the obvious registry key to the settings of our cluster resources for the scheduler services (SOFTWAREIBMADSMCurrentVersionBackupClientNodes<TSM NODE NAME>) and the scheduler service would use the same registry key to store it’s passwords. But it seems we were far off with that assumption.

The scheduler service uses another registry key, it’s quite similar to the one the GUI is using, but it’s different enough (SOFTWAREIBMADSMCurrentVersionNodes<TSM NODE NAME>).

Microsoft Cluster Services powered by IBM

If you think back, I talked about my problems with MSCS while utilizing the IBM RDAC Multipath driver for Windows.

Everyone I talked to about this, including our IBM business partner and it’s systems engineers; as well as some IBM systems engineer (who in fact was an freelance guy hired by IBM), told me it had to do with how we did the zoning (stuffing every controller into a single zone), and that would be the reason why the x3650 was seeing that many drives.

When the freelance SE came to visit us, we redid the zoning, separating each endpoint connection (each HBA port to each controller port) into a different zone.

Additionally he told me, that was the only IBMâ„¢ supported configuration.

SAN Zoning (Overview)
SAN Zoning (Overview)

As you can see, I had to create ten different zones for each single port of the dual port fibre channel HBA and it’s corresponding endpoint (I guess, I still have to create more, since the DS4700 is having *two* ports per controller).

SAN Zoning (Detailed)
SAN Zoning (Detailed)

After we finished that, we rebooted the x3650 and hoped that would have fixed. Afterwards the IBM SE was baffled. Still seeing ~112 devices. What the heck ? He ranted about how awful this was and did some mumbo jumbo with his notebook, uploaded the ds4?00 configuration files to some web interface, but shortly afterwards said the storage configuration seemed to be fine on the first glance.

So we had another look at the storage configuration and he quickly found, that the other cluster ports were set to “Windows Cluster 2003 (Supporting DMP)” in the port configuration and said that’d be the cause why stuff still ain’t working (I think he guessed wildly, since he had no clue either). After I told him, I just can’t change those ports right now (since the remaining part of the cluster is in full production), we agreed that I’d do it some other time and tell him about my results.

Anyways, the next day my co-workers suggested, trying a newer Storage Manager version on the x3650, at the same level with the highest firmware version on the storages (thus being the DS4700 and v09.23). Now guess what ?

That fucking works. The cluster is still behaving weird sometimes (now the other boxen seem to have trouble bringing resources online, but only sometimes).

So here my hint: Always keep an old version of the Storage Manager around, you can’t get them from IBM anymore *shrug*

Windows Cluster Service (continued)

Well, guess my “solution” didn’t work sooo good. Lemme tell you what’s happening. I successfully added the node to the cluster group, but I can’t get *any* resources online.

The node tries bringing it online, then shows a failure and immidiately moves them over to the next node. There the resource is being successfully moved online .. So again, I’m out of ideas ..

Already tried reinstalling the box, after that I could get the third node successfully into the cluster, without the “Advanced (minimum)” trick … *shrug* still ain’t bringing any resources online.

IBM RDAC and Windows Cluster Service

Okay, so we received a brand new x3650 the other day entitled to replace one (or better two) of our NAS frontend servers. We installed Windows on it the other day (had to create a custom Windows Server 2003 CD first, since the default one doesn’t recognize the integrated ServeRAID), and we prepped the box during the week with the usual things.

On Monday I started installing the “IBM StorageManager RDAC” MultiPath driver (since the box got two single port PCIe FC-HBA’s) and figured I’d be nice if we had this. I asked a IBM Systems Engineer of one of our partners, which told me generally there wouldn’t be a problem with Microsoft Cluster Services (MSCS) and the IBM MPIO driver. Only requirement would be that I’d install the new storport.sys driver (version 5.2.3790.4021) first (as in Microsoft KB932755).

Now, yesterday I finished the zoning, did the mappings on the storage arrays and then figured the box should see the hard disks. So I started adding another node to our existing Microsoft Cluster.

Result: Zip (as in MSCS telling me not all nodes could see the quorum disk)

Reason: a combination of two things. First, said IBM Storage Manager RDAC. The first time I installed it, I forgot about the storage mappings, thus the box seeing zero disks. After uninstalling it, I was seeing 121 (that’s right, one hundred and twenty one) new devices.

Visible volumes previous to installing the RDAC driver
Visible volumes previous to installing the RDAC driver

That is basically a result of the zoning I did for this particular device, which has *all* controllers present in a single SAN zone, thus the HBA’s seeing devices eight (or nine) times .. Update: yes, I’m missing one controller … 😀

SAN zoning for the box
SAN zoning for the box

Now, as I reinstalled the RDAC *after* the host discovered the volumes, it’s showing only a dozen drives.

Visible volumes after installing the RDAC driver
Visible volumes after installing the RDAC driver

Now, as I figured this out, I told myself “Hey, adding the third node to the Windows Cluster should now work without a clue …” … guess what ?

It’s Microsoft and it doesn’t. Now why doesn’t it work ? ‘Cause the Cluster Setup Wizard is getting confused in Typical mode, as it’s creating a “local quorum disk” which naturally isn’t present in the cluster it’s joining. Now, switching the wizard to “Advanced (minimum) configuration” as suggested in Q331801, just works … *shrug*