A week ago (September 02nd), I received a mail detailing the release of IBM’s new multipathing device driver for the DS4x00 series, which finally works with SLES11 (the available software up till now doesn’t — as in fails with kernels > 2.6.26 iirc).
There wouldn’t be any trouble, if IBM (or rather the vendor providing the driver — LSI) would actually release the driver … up till today, I have yet to see the new version appear on the download page. I already tried to notify IBM about the trouble, but as usual there is lack of ways to actually get this to the right person.
Well, IBM just replied to my feedback and apparently the download is available (it is right now, after two weeks hah — finally).
After some more tinkering, a lot more looking at the macros in /usr/lib/rpm/rpm-suse-kernel-module-subpackage and /usr/lib/rpm/suse_macros, I think I finally have a usable RPM’ified version of IBM’s Multipathing driver ready for use.
There is still one major annoyance left: each time you install a new ibm-rdac-ds4000-kmp RPM, you also need to reinstall the corresponding ibm-rdac-ds4000-initrd package, as the macros in /usr/lib/rpm don’t allow for custom %post or %postun.
As mentioned before, I’m gonna send them to LSI/IBM for review, and maybe, MAYBE they are actually gonna make use of that.
Without further delay, here’s the list of packages. Just a short explanation: you need mppUtil-%version, in order to install the ibm-rdac-ds4000-kmp.
This package should be usable with System Storage DS4000 as well as System Storage DS3000 (they use the exact same source code).
I also know, that this solution isn’t really perfect. I’ve been looking at the %triggerin/%triggerun macros, but right now I can’t draw up a scenario (an easy one at that) to successfully use triggers in this situation. Only idea coming up looks like this:
Put the triggers into ibm-rdac-ds4000
When installing the kernel module packages, write the kernelversion/-flavor into a temporary file (impossible, since the macros don’t let you influence %post), and then let the trigger create/update the MPP initrd
If anyone knows a better solution (as in easier, without the writing to a separate file), I’m all ears.
After a short tinkering, I got it actually working. I was kinda surprised, at how easily it actually is. One problem I still have to deal with, is modifying the %post, to generate the mpp-initrd image. For now, the KMP only contains the default %post, which updates the modules.* stuff.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
Name:ibm-rdac
License:GPL
Group:System/Kernel
Summary:IBM Multipathing driver forDS4000 disk subsystems.
Now, I’m kinda asking myself, why don’t more vendors submit their drivers to Novell in form of KMP’s … Anyway, I’m gonna send mine the LSI/IBM way, maybe they’ll pick it up …
Well, kernel updates on our Linux servers running IBM’s RDAC driver (developed by LSI) is a real pest .. especially if you have to reboot the box two times in order to install the drivers/initrd correctly.
So I sat down and looked at the Makefile. Turns out, it just needs four tweaks in order to be working with a different kernel version (which you have to pass using environment variables to make).
After that, a simple make KERNEL_OBJ=/lib/modules/2.6.16.60-0.37_f594963d-smp/build OS_VER=2.6.16.60-0.37_f594963d-smp install correctly installs the modules in /lib/modules, rebuilds the correct modules dependencies and builds the correct initrd image.
Everyone I talked to about this, including our IBM business partner and it’s systems engineers; as well as some IBM systems engineer (who in fact was an freelance guy hired by IBM), told me it had to do with how we did the zoning (stuffing every controller into a single zone), and that would be the reason why the x3650 was seeing that many drives.
When the freelance SE came to visit us, we redid the zoning, separating each endpoint connection (each HBA port to each controller port) into a different zone.
Additionally he told me, that was the only IBM™ supported configuration.
As you can see, I had to create ten different zones for each single port of the dual port fibre channel HBA and it’s corresponding endpoint (I guess, I still have to create more, since the DS4700 is having *two* ports per controller).
After we finished that, we rebooted the x3650 and hoped that would have fixed. Afterwards the IBM SE was baffled. Still seeing ~112 devices. What the heck ? He ranted about how awful this was and did some mumbo jumbo with his notebook, uploaded the ds4?00 configuration files to some web interface, but shortly afterwards said the storage configuration seemed to be fine on the first glance.
So we had another look at the storage configuration and he quickly found, that the other cluster ports were set to “Windows Cluster 2003 (Supporting DMP)” in the port configuration and said that’d be the cause why stuff still ain’t working (I think he guessed wildly, since he had no clue either). After I told him, I just can’t change those ports right now (since the remaining part of the cluster is in full production), we agreed that I’d do it some other time and tell him about my results.
Anyways, the next day my co-workers suggested, trying a newer Storage Manager version on the x3650, at the same level with the highest firmware version on the storages (thus being the DS4700 and v09.23). Now guess what ?
That fucking works. The cluster is still behaving weird sometimes (now the other boxen seem to have trouble bringing resources online, but only sometimes).
So here my hint: Always keep an old version of the Storage Manager around, you can’t get them from IBM anymore *shrug*
Okay, so we received a brand new x3650 the other day entitled to replace one (or better two) of our NAS frontend servers. We installed Windows on it the other day (had to create a custom Windows Server 2003 CD first, since the default one doesn’t recognize the integrated ServeRAID), and we prepped the box during the week with the usual things.
On Monday I started installing the “IBM StorageManager RDAC” MultiPath driver (since the box got two single port PCIe FC-HBA’s) and figured I’d be nice if we had this. I asked a IBM Systems Engineer of one of our partners, which told me generally there wouldn’t be a problem with Microsoft Cluster Services (MSCS) and the IBM MPIO driver. Only requirement would be that I’d install the new storport.sys driver (version 5.2.3790.4021) first (as in Microsoft KB932755).
Now, yesterday I finished the zoning, did the mappings on the storage arrays and then figured the box should see the hard disks. So I started adding another node to our existing Microsoft Cluster.
Result: Zip (as in MSCS telling me not all nodes could see the quorum disk)
Reason: a combination of two things. First, said IBM Storage Manager RDAC. The first time I installed it, I forgot about the storage mappings, thus the box seeing zero disks. After uninstalling it, I was seeing 121 (that’s right, one hundred and twenty one) new devices.
That is basically a result of the zoning I did for this particular device, which has *all* controllers present in a single SAN zone, thus the HBA’s seeing devices eight (or nine) times .. Update: yes, I’m missing one controller … 😀
Now, as I reinstalled the RDAC *after* the host discovered the volumes, it’s showing only a dozen drives.
Now, as I figured this out, I told myself “Hey, adding the third node to the Windows Cluster should now work without a clue …” … guess what ?
It’s Microsoft and it doesn’t. Now why doesn’t it work ? ‘Cause the Cluster Setup Wizard is getting confused in Typical mode, as it’s creating a “local quorum disk” which naturally isn’t present in the cluster it’s joining. Now, switching the wizard to “Advanced (minimum) configuration” as suggested in Q331801, just works … *shrug*