Nagios & plugins

Since we started utilizing Nagios‘s power two months ago, I finally came up with a C-based ram-plugin for nagios. The biggest problem I had with the python and perl based plugins, that some distributions (yes, SLES and Debian) don’t install either Python or Perl.

Since I wanted a manageable setup (as in unified code base across all distributions), I wanted it to work without installing too much. So I took the swap plugin and basically removed what wasn’t necessary and voila!

Here we go, yay ME!

Only thing I need to finish sometime soon, is getting the NSClient++ work on my Windows boxen (which I do have quite a few, the domain controllers, nas-cluster, …)

Thin clients

As some of you people know, we (as in the University) recently purchased some Thin Clients in order to replace some oldish’ computers and solve the software management at the same time.

The Thin Clients ain’t bad, they are Wyse V90L‘s and they (as in Wyse) use their own management software to manage and deploy those thin clients and software. The bad thing about that, is it’s using it’s own “Scripting Language” (if you can call it that way – it’s more pseudo scripting since you can’t do much with it besides some basic actions).

The Wyse Device Manager also introduces it’s own limitations. Up till now our DHCP had server options for thin clients in some other facility and thus was sending those to *all* subnets it’s acting on, thus overwriting (or disabling ?) the DHCP ACK send out by the WDM. But that’s only one.

The second one is that the WDM seems to use (or expect) the American date format internally. But how did I stumble upon that ? As you know, I’m living and working in Germany and we use a different time/date format than the American’s. Now, lets assume you want to deploy packages to you Thin Clients when no one’s using them anymore (like say – around midnight), you drag your package upon the devices and select “Specific Date/Time“. Now if you have the German date/time format, the WDM will simply tell you, “Deployment date can’t be less than the current system time.

Initially I was like *WTF*, didn’t I just enter a time in the future (~15 minutes in the future from where we at right now) ?

First I looked through their Knowledgebase, but as nothing was documented there, I called their Support Hotline, where I briefly told ’em what was going on and he immediately told me, that’d I was hitting that error, since the WDM *is* using _and_ expecting the American time/date format internally. But he told me that they’ll hopefully fix that with the next release since they get asked that quite often.

TYPO3 and MySQL replication

Apparently the TYPO3 version we are using, doesn’t play too nice with the MySQL MasterMaster replication.

Sometimes, something like this is going to happen:

Well, as you can see from the last line in the log, the Slave-SQL thread found a duplicate entry and thought it is smart to just turn off the thread instead of disregarding the just made entry. So from now on, both databases drift since there ain’t no replication anymore until someone kick starts the replication again (someone being me).

Anyway, I think I finally traced the fucker down, supposedly one of the problematic cases is located in t3lib/class.t3lib_tstemplate.php on line 362.

Basically what TYPO3 is doing is a DELETE and an INSERT right afterwards. But apparently, it doesn’t check whether the DELETE even succeeded. I hacked it for now, simply adding this:

Sadly, this looks more and more like a race-condition between the two boxes (as in the replication / UPDATE being too slow), when users visit a edited site, that hasn’t had it’s cache regenerated yet. Problem is, it ain’t just this single spot, but also the search indexing, image cache and the whole page cache. For now we switched the cluster to active/passive load balancing, till we have a chance to see if a newer TYPO3 fixes those issues.

PacketPro 450 and SSH checks

As apparently the guys at Teamix read my recent blog post about their cluster solution, someone of their technical support called me on friday at work 😯

And pointed out

  1. That I’m free to express my thoughts about their product (which I recently did)
  2. That there is a better way to workaround this issue

He also said, its something which they had asked multiple times. It’s as simple as editing the Virtual Server and changing the service inspection from “Connection” to “None” .. *duh*

Don’t get me wrong, the previous rant simply originated from the logs filling up within three day. I still like the PacketPro.

Bloody cluster solutions (continued)

So, as the previous try on getting the teamix people to fix the bloody LoadBalancer (as in sending at least an identification string for the SSH check) didn’t work so well (they told me, I should configure MASQuerading/ROUTEing on the PacketPro (which is kinda icky), I went on today and looked at what SLES10 installs as default logger.

Surprisingly they install a rather new syslog-ng (well, syslog-ng-1.6.8 is what they ship) so it was rather easy to workaround the situation.

Here’s what already was in the syslog-ng.conf.in (more on that later):

which I just extended with the following:

Afterwards just a quick SuSEconfig -module syslog-ng, restart the syslog daemon and the messages were gonse. Sure I know it’s a rather ugly hack πŸ˜† , but since they refused to provide a “true” fix and it seemed like that question has been asked more than once it works for me, so *shrug* πŸ˜›

But now you’d ask why syslog-ng.conf.in ? Simply because Novell figured it would be too easy to just invent things like CONFIG_PROTECT for RPM/YaST, so they placed yet another file in there; from which the syslog-ng.conf files is generated every time SuSEconfig is being executed (that’s like every time you install a package using YaST).

Fujitsu Siemens, onboard NIC’s, Quality assurance and vendors

So we bought some Fujitsu Siemens P5916 Intel vPro back in January/February for the Boss and his secretary.

These boxes are quite nice, come with a Core 2 Duo (which is waaay to overrated for simple business applications like Word, Excel, Access and Outlook), but he insisted on having Windows Vista Ultimate ready PC’s.

We got them, as expected completely *blank*. Wasn’t so much of a problem though, since we have a Select 5.0 6.0 contract with M$. Only problem was, they refused to install Vista (as in freezing after preping the HDD). So I called our local vendor, who told me “Go, grab the latest BIOS from the support page and perform a BIOS update!” – Which I wasn’t so happy about to hear and to do … That didn’t work, the box would freeze on boot now …

So we reprimand our local vendor, who pushed the liability away from themselves and onto Fujitsu Siemens Computers (since they labeled these things Vista Ready). Next thing I know, I was talking to the sales person responsible for the R&D (F&L in german) in Mecklenburg-Vorpommern, claiming “It would have been bettar if you bought these with Vista preinstalled – eh ?“, which I doubted (and still doubt) since drivers can’t change if you can install Vista on it when Vista itself considers the BIOS “not ACPI compatible” … πŸ‘Ώ

That was about the time when I stopped listening and thought about buying Dell desktops from now on … since I’m completely sick and tired of being treated like the last low-tech moron by a) sales representatives, b) vendors, c) lvl2 technical support and d) engineering.

Anyway – I was trying to tell today’s story .. So the Boss called me in around 9′, asking me to take a look at his Outlook since it complained about “H:Outlook.pst” not being present (H: is the drive for the roaming profiles and the private data for every employee). So I looked a bit further, into the Event log of this Vista box where I found something like “No Logon Server found, your last locally saved profile is being reused, please contact your administrator”. From there on, I was rather – err -puzzled about the way Windows Vista is handling Roaming profiles.

Opened up a command prompt, tried ping‘ing the router in the subnet and got a garblish response from ping (which I’ve never seen before). First I checked whether the cable was OK (it was), afterwards I went grabbing his computer back to the workroom, plugged in a separate NIC, which worked but Vista didn’t had drivers for. So plugged in the next one, googled for Vista drivers (which I luckily found), plugged in my pendrive and hoped they’d work with Vista .. but NOOOOOOOOOOOOOO.

So I pulled the NIC again, only to see that the model numbers differed in the second digit (I plugged in a 500TX while I had a 530TX in my hands to look at the model number). Plugged in the NIC in my hands, did the same game again .. and VoilΓ , “Houston, we have lifted off … “.

Carried the PC back into his office, plugged it in, told him he could try to login now … and at finally around 10:30’ish he had his PC in a working condition back, and at least it seemed as if he was rather happy about it πŸ˜†

SLES, ZendOptimizer and IBM PowerPC(4)+

What would you figure from the above ? Hopefully the rather obvious, that it’s a *really* shitty combination.

So we figured it would be a nice thing to test our new setup before going into pre-production testing or production, but we don’t have an extra spare box. So we took one of the power4 boxes we have mounted in the rack basically consuming energy all day (that’s about 38kWh a day) and installed SLES10 onto it. Which wasn’t all that bad (at first the box repeatedly started back to AIX, from CD and after convincing the SMS – that’s basically the bios on the power*-boxes also known as System Management Services with a hammer to boot from the first hard disk).

The real bad part started later. First the box committed suicide sometime on the weekend (the last one that is), which is rather not so good.

So we installed the ocfs2-tools (which is obviously needed if you want do writes on a SAN volume mounted on two separate boxes), configured the o2cb thing to start automatically on boot and added the entry to /etc/fstab.

So far so good, but as we slowly activated the apache-vhosts, we finally came to what cost me about three damned hours of my life:

Now guess what … ZendOptimizer just went bye-bye … Damn and what now ? So I looked at the Knowledgebase on zend.com, even found an Article stating it’d do that from time to time

And attached also the usual crap .. “Please update to the latest version”. Only problem with that is that the latest version is indeed available for x86_64 (meaning amd64 in Gentoo terms), but ain’t for ppc (even if the product page states it should be).

So I went home, knowing what the problem is – since it was already past 4pm – swearing a short “frack that“.

Now that I’m home, ate something (a rather good salad), listening to some Korn/Kid Rock/Offspring and after doing some undertakers work, I asked myself “Why exactly do we need that crappy application anyway ?” (beyond the obvious point, that the ZendOptimizer is like/ is a php-compiler cache).

It turns out, one of my co-workers wrote a TYPO3-plugin interfacing our local research database .. and the catchy thing is, guess what …

He “guarded” it with ZendGuard, thus we need to use the ZendOptimizer thingy; otherwise we couldn’t use it either … 😯

O RLY ?
O RLY ?

Dell PowerEdge 1855, DRAC/MC, firmware updates, telnet and csr’s

Today I played a bit with our PE Chassis, or more specifically the DRAC/MC (remote management console). One of the things I’ve been experiencing was that the DRAC/MC was rather slow when browsing on the web interface (as in waiting a minute for the jnlp for the KVM to download). So I went ahead, fired up net-misc/atftp on my notebook, put the firmware update provided by Dell in the TFTPROOT and executed this in my telnet session on the DRAC/MC:

You may ask now, wtf does he use telnet for on that box ? It’s as simple as Dell isn’t providing anything else to use, the switches come w/ ssh, but not the management console. Only way to get ssh is to buy a new one, which is like 500 EUR.

Waited a few minutes impatiently for the DRAC/MC to come back up (and it finally came back up). The good thing is, the DRAC/MC is now at least a bit faster (at least I feel its a bit faster) and we’re up at mgmt-1.4.2.

Now, since we are a member of the DFN CA, we are able to generate signed certificates (at least Internet Explorer recognizes it through the DTAG Root certificate – which Mozilla products sadly don’t have by default). For that I need a 2048 bit PCKS#10 (or CSR), which I tried to squash out of the DRAC/MC. But what the hell ❓

The DRAC/MC only gives me a 1024 bit one without the possibility to choose what kind of CSR I want to generate … 😑

miimon, arp_interval and the code

After today’s adventure with the kernel bonding, I just took a look at the code

If I read it right, you only get the KERN_WARNING for “either miimon or arp_interval” only if miimon or arp_interval isn’t set … but at least my config says it is .. *shrug* .. bed time for me πŸ™„

Bloody cluster solutions

In preparation to get our website (and all those other websites – like www.fh-neubrandenburg.de or www.hmt-rostock.de) clustered, someone bought the cluster version of the PacketPro 450. These things are nice, especially considering you don’t need to fiddle around with LVS yourself (which is a *real* pain in the ass).

The only problem I have currently with them is that they scan the database and web nodes every 30 seconds, and since we have an active node and a hot-standby both do this and producing this:

That’s only the logs from three minutes … now figure you have it running for like four days and figure what the average log size due to such crap is … But at least it looks solvable, though I gonna have to call them tomorrow and ask for a patch/update to get their ssh-scan to send some banner when performing the service check.