Nagios: Watching Clustered environments (the other way)

Well, recently I stepped up to watch our cluster environments … Michael has a good howto on how to watch Windows Cluster environments in the NSclient++ wiki.

Now, this has it’s own perks … Which I stumbled upon when trying to write a Linux-HA OCF resource agent for the Nagios NRPE server. Combining that Linux-HA with SLES10 is a good thing generally, but using startproc in that resource agent is not such a good idea.

Apparently Novell (or SuSE GmbH) thought it might be wise to include some additional logic into the wrapper. startproc, checkproc and killproc do check for the name of the executable. So if you try to start an additional process with the same name, you need to dig a bit deeper.

For this to work, you need two additional things (quotations directly from man 8 startproc):

-p pid_file
(Former option -f changed due to the LSB specification.) Use an alternate pid file instead of the default (/var/run/<basename>.pid). The pid read from this file is being matched against the pid of running processes that have an executable with specified path of the program. In order to avoid confusion with stale pid files, a not up-to-date pid will be ignored.

Now, then apparently this isn’t enough. startproc is still refusing to start a second process.

-i ignore_file
The pid found in this file is used as session id of the same binary program which should be ignored by startproc.

Read More

Linux-HA: Creating a random authkey

I just looked over the slides of a presentation one of my trainees bought back from Chemnitz, and there was this nifty one-line command, with which you can generate a random sha1sum for your authkeys file.

Now, since I’m a bit lazy here’s the full command line to fill /etc/ha.d/authkeys for you:

TSM client: Backing up files with umlauts on SLES

In the past, I always had problems with SLES and our Tivoli Storage Manager client’s when backing up files with german umlauts. Well, today I looked a bit harder, and quite quickly found a solution.

As you can see from the above, SLES9/10 ain’t setting LANG or LC_ALL (which I searched for first), but is setting LC_CTYPE.

So, simply changing the LC_CTYPE in the init-script and/or prepending the dsmc command line with a new LC_CTYPE fixes my umlauts problems!

Well, I had a long’ish talk with one of my trustworthy IBM senior consultants the day after writing this …

He told me something along the lines of this:

If you would like to back up files with names containing characters with a code > 127 please ensure that you have chosen a SBCS character set for your locale. The default code page C or the code page POSIX supports characters up to 127 only. Files whose names contain special characters will be skipped if C or POSIX is used. It is strongly recommended to perform a system backup by using a SBCS character set to prevent any file or directory from being skipped. This behavior for different locales is intended.

And this:

The UTF-8 locale is default on some Linux platforms. However, TSM Client currently does not support running under UTF-8 locales (such as en_US.UTF-8 and ja_JP.UTF-8). Export your LANG and LC_ALL environment variables to the iso8859-1 or EUC versions of your locale and then start a new xterm (or mlterm) session prior to running TSM Client.

That basically means, at least for using the TSM Client Java Interface (dsmj) and the scheduler/client acceptor daemon you have to switch your locales to something _not_ UTF-8 capable.

He also mentioned, that IBM doesn’t have a real solution for this problem, as well that there is no real workaround. You need to invest some time into figuring out the “right” locale setting for your system(s), since after writing the above I came to the result that it ain’t enough ..

You need to do the following:

After doing so, the scheduler and the command-line client works …

Nagios: check_snmp again

Well, today I had to grind my head again, regarding the way check_snmp handles WARNING and CRITICAL events. From my point of view, check_snmp is really just retarded sometimes.

As you know, all the other plugins accept WARNING and CRITICAL-thresholds based on the calculation, if the return integer is above this threshold it reached WARNING/CRITICAL state. But check_snmp doesn’t play that way.

It expects only ranges, which are NOT gonna result in warning or critical events. Which is kinda stupid, since you gotta rethink twice about the thresholds ๐Ÿ˜›

All in all, another lesson learned ๐Ÿ˜ฎ

Nagios: NSclient++ in a clustered Environment

Well, most of you already know that I’m a Nagios fanatic. I like to watch as many aspects as I possibly can. So, yesterday I started figuring out ways to watch our different cluster groups (housing a bunch — try above 20.000 — of file shares).

Now, my first tries failed horribly. I brought down a complete cluster group, resulting in a major annoyance. Now, today I went at it a bit smarter ๐Ÿ˜› I cloned myself two VM’s off my Windows Server 2003 Enterprise R2 template, created a new cluster.

After that, I tried it on the test cluster again, same result. The resource is successfully created, but once I try to take it online, it breaks and moves the whole cluster group to the other node (as cyclic moving between the cluster nodes with no end).

After that, I figured something has to be wrong with the command I’m trying to use, the one as instructed by the NSClient++ wiki. I then tried the command on the command line, but as soon as hitting <TAB> (oooold bash habit ๐Ÿ˜› ), it completed the path, but put quotes around it … Don’t ask me.

If I try the path without the quotes, no-joy at all. Once you put quotes around it, everything becomes honky-dory and the resource comes online without the slightest trouble!

Hint to self: When creating a NSClient++ cluster resource (or any application resource using a command that needs switches for that matter), use a quoted command line along the lines of this:

VMware: New VirtualCenter 2.5 Update 4

As many people on the VM-Planet already blogged about this, I ain’t gonna write just about it. Let’s turn the clock back a few months, to January 2008.

As the institution I work for, is part of the DFN we took the opportunity to be a part of the “I want you to run our RA“-gang. In January 2008 we thought about changing the vCenter certificate. Now, apparently there’s a slight difference between the DFN-PCA and what VMware considers common practice.

The DFN-PCA states, that only CSR’s with a key length of 2048 bits are allowed (as outlined in 6.1.5 of the DFN-PKI Certificate Policy). Now VMware apparently didn’t actually think customers would use this “feature” (that is changing the SSL certificates).

Customization Specifications Created in Previous Releases Can Be Used in VirtualCenter 2.5 Update 4 to Clone or Deploy Virtual Machine with Customized Guest Operating Systems
This release resolves an issue where, if you clone or deploy a virtual machine using a customization specification that was created prior to upgrading the VirtualCenter, the VirtualCenter Server might display the error message The VirtualCenter server is unable to decrypt the passwords stored in the customization specification in the following scenarios:

  1. VirtualCenter Server is uninstalled first, and then re-installed and/or upgraded afterwards.
  2. Custom SSL certificate are deployed, but the instruction in http://www.vmware.com/pdf/vi_vcserver_certificates.pdf are not followed in a verbatim manner.

Well, and apparently it ain’t fixed yet. At least not for us ๐Ÿ˜•

MySQL: Beware of sync_binlog on EXT3

Well, I just glazed again over my my.cnf for our web-cluster because I just moved a database from one cluster to another and getting quite different performance from it. So, as I expected, there is a slight difference between both configuration files:

And apparently, according to the MySQL Performance Blog that’s really, really bad (as well, we’re currently running without write caching, as the battery module of the storage is dead).

Tivoli Storage Manager Client and Microsoft Cluster Services (continued)

As you might recall from my first article about this topic, I had some troubles with the Microsoft Cluster Services and the registration replication. Now, today as we tried switching the TSM-Server for some resources, we ran into this again.

We were using the service install tool (dsmcutil install scheduler) to set the new password as well as the GUI. Now, as we brought the resource online with the local service manager, everything was honky dory. But as soon as we brought it online using the Cluster Manager, it failed horribly. Why ?

Well, as I read the Microsoft KB the last time, I started remembering something about the replication.

  • When the resource goes online, the registry keys are updated with the previously checkpointed information.
  • When the resource is brought offline, all the checkpoints associated with this resource are saved.

If you manually update these registry keys while the application or service is offline, the changes may not be replicated or may be lost. To prevent this from happening, make any manual changes while the service or application resource is online.

Simply put, when you toggle the resource offline, the cluster saves the registry from the currently running node onto the quorum (checkpoints). As we changed those settings while the resource was offline, it discarded them, as we toggled it back online with the Cluster Manager.

Simple solution: just remove the registry replication parameter when the resource is offline (and click “Apply” and “OK” afterwards). After that update the registry on the cluster node currently owning the physical disk drive (either using the GUI or dsmcutil). Afterwards, re-add the registration key and you should be able to “force” the Microsoft Cluster into thinking that the registry you have on this cluster node is the valid one.

MySQL: Replication and hostname wild cards

Yeah, yeah .. I know, it’s weekend. But I usually can think much better when no one is rattling my cage. So I had another look at my replication problems.

  1. Don’t you never ever change InnoDB settings when migrating between hardware,
    because InnoDB is rather sensitive regarding those parameters.
  2. When you’re setting up the replication (don’t ask me why) and copying over the database to the second replication partner, be aware if you’re using wild cards you’re gonna get seriously bitten in the back.

Now, let’s look at the constellation.

mysql-nodes

As you can see on the graph above (hah, sometimes Visio is rather useful ๐Ÿ˜› ), we do have two MySQL nodes, each serving as master (as in we’re doing “normal” master-master replication).

Here’s what we’re gonna do first:

  1. Setup the user mysql_repl for mysql%.home.barfoo.org, granting REPLICATION SLAVE.
  2. Setup the user mysql_slave for mysql1.home.barfoo.org and mysql2.home.barfoo.org, also granting REPLICATION SLAVE.

Afterwards, we’re gonna copy the mysql database (either via tar and scp, or just via rssh — which is rsync via ssh) to both nodes. Read More