Nagios: Watching Clustered environments (the other way)

Well, recently I stepped up to watch our cluster environments … Michael has a good howto on how to watch Windows Cluster environments in the NSclient++ wiki.

Now, this has it’s own perks … Which I stumbled upon when trying to write a Linux-HA OCF resource agent for the Nagios NRPE server. Combining that Linux-HA with SLES10 is a good thing generally, but using startproc in that resource agent is not such a good idea.

Apparently Novell (or SuSE GmbH) thought it might be wise to include some additional logic into the wrapper. startproc, checkproc and killproc do check for the name of the executable. So if you try to start an additional process with the same name, you need to dig a bit deeper.

For this to work, you need two additional things (quotations directly from man 8 startproc):

-p pid_file
(Former option -f changed due to the LSB specification.) Use an alternate pid file instead of the default (/var/run/<basename>.pid). The pid read from this file is being matched against the pid of running processes that have an executable with specified path of the program. In order to avoid confusion with stale pid files, a not up-to-date pid will be ignored.

Now, then apparently this isn’t enough. startproc is still refusing to start a second process.

-i ignore_file
The pid found in this file is used as session id of the same binary program which should be ignored by startproc.

Now, we need to construct a proper command for this:

 startproc -p /var/run/nrpe.pid -i &quot;/var/run/nrpe*.pid&quot; 
   /usr/bin/nrpe -c /etc/nagios/nrpe.cfg -d

1 2	startproc -p /var/run/nrpe.pid -i "/var/run/nrpe*.pid" /usr/bin/nrpe -c /etc/nagios/nrpe.cfg -d

This works quite fine from the command line, but just doesn’t work from within the OCF resource agent.

I tried for about half an hour figuring out, why the heck this wasn’t working from within the OCF resource and then gave up for some brainstorming with my trainee.
It all boiled down to “Why exactly do I need to cluster the daemon ?”

The main issue is that if you have a multiple node cluster, you do not know which one is active at any given time. If you monitor both nodes using Nagios, one will always be critical while the other is not. This is not a good situation. The cleanest solution is to cluster the NSClient++ daemon, so that it is always running on the active node.

That is only partly true. Usually each cluster group has at least one dedicated IP address. NRPE as well as NSclient++ is listening to all available IP addresses.

That means, if you run NRPE or NSclient++ on all available cluster nodes, you don’t need to cluster the daemon. That’s one huge hassle you don’t need to worry about anymore.

S	M	T	W	T	F	S
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30	31

Leave a Reply Cancel reply