NRPE – BAFM

Well, recently I stepped up to watch our cluster environments … Michael has a good howto on how to watch Windows Cluster environments in the NSclient++ wiki.

Now, this has it’s own perks … Which I stumbled upon when trying to write a Linux-HA OCF resource agent for the Nagios NRPE server. Combining that Linux-HA with SLES10 is a good thing generally, but using startproc in that resource agent is not such a good idea.

Apparently Novell (or SuSE GmbH) thought it might be wise to include some additional logic into the wrapper. startproc, checkproc and killproc do check for the name of the executable. So if you try to start an additional process with the same name, you need to dig a bit deeper.

For this to work, you need two additional things (quotations directly from man 8 startproc):

-p pid_file
(Former option -f changed due to the LSB specification.) Use an alternate pid file instead of the default (/var/run/<basename>.pid). The pid read from this file is being matched against the pid of running processes that have an executable with specified path of the program. In order to avoid confusion with stale pid files, a not up-to-date pid will be ignored.

Now, then apparently this isn’t enough. startproc is still refusing to start a second process.

-i ignore_file
The pid found in this file is used as session id of the same binary program which should be ignored by startproc.

Well, I just noticed a really weird thing, when you have command line arguments enabled.

Here’s a snippet from my nrpe.cfg:

dont_blame_nrpe=1
command[check_disk]=/usr/lib/nagios/plugins/check_disk -E -w $ARG1$ -c $ARG2$ -p $ARG3$

1 2	dont_blame_nrpe=1 command[check_disk]=/usr/lib/nagios/plugins/check_disk -E -w $ARG1$ -c $ARG2$ -p $ARG3$

Now, if you’d check the free space for the root, it ain’t gonna show any inode percentage (that one isn’t what I’m talking about). But if you have to use bind mounts like I do (Tivoli needs a separate “domain” — that is a separate mount point for each domain), you might wanna check the free space on the *real* device, rather than the free space on the bind mount (which is gonna show you the free space of the parent file system – in my case the root fs).

Let’s take a look at what I’m talking about. If you use the check_disk locally like this:

./check_disk -w 20% -c 10% -p /apache/
DISK OK - free space: /apache 11090 MB (36% inode=36%);| /apache=19629MB;24575;27647;0;30719

1 2	./check_disk -w 20% -c 10% -p /apache/ DISK OK - free space: /apache 11090 MB (36% inode=36%);\| /apache=19629MB;24575;27647;0;30719

Means, everything is okay, you have to pass the extra trailing slash to the –partition argument, as otherwise it would pick up the bind mount at /backup.

Now, if we do the above by means of NRPE, that’s gonna get you a different result. As I showed above, I have the check_disk command in my nrpe.cfg, I also specifically enabled command arguments during compile time.

./check_nrpe -H nagios.home.barfoo.org -c check_disk -a 20% 5% /apache/
DISK CRITICAL: /apache/ not found

1 2	./check_nrpe -H nagios.home.barfoo.org -c check_disk -a 20% 5% /apache/ DISK CRITICAL: /apache/ not found

Now, why the hell isn’t it picking up the *original* mount point of the file system ? Guess why … Because I added -E to the command, because it didn’t use the original mount point but rather the bind mount in /backup. Removing the -E and it picks up the *original* mount point without any trouble *shrug*.

S	M	T	W	T	F	S
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

Tag: NRPE

Nagios: Watching Clustered environments (the other way)

Suspected NRPE weirdness