Nagios: Service Check Timed Out

Since I got the pleasure of watching some Windows boxen with Nagios, I took the Windows Update plugin from Michal Jankowski and implemented it. It took me some time, to initially set up the nsclient++ correctly so it just works, but up till now the check plugin sometimes reported the usual “Service Check Timed Out”.

Usually I ended up increasing the cscript timeout, or the nsclient++ socket timeout, but it still kept showing up. Since I rely heavily on my surveillance tools, I have the demand, that as few as possible false positives show up. So I ended up chasing down this error today, and after that I have to say it was quite simple.

In my case, it wasn’t cscript (that timeout is set to 300 seconds), neither nsclient++ (socket timeout is set to 300 seconds too), nor the nrpe plugin itself (that has 300 seconds as well).

As it turns out, Nagios got an additional setting controlling these things, called service_check_timeout which defaults to 60 seconds. Sadly the plugin, or rather Windows needs longer than those 60 seconds to figure out whether or not it needs updating, thus Nagios is killing the plugin and returning a CRITICAL message.

After increasing the value of service_check_timeout that’ll be fixed hopefully.

SLES10: zypper.log

Well, I just stumbled upon something .. My Nagios at work wasn’t working anymore, and I went looking.

After that, zip – nada. Next thing, check whether or not the device is really full … Okay, df ..

So, it is actually completely filled up. So, now we need to find who’s hogging the space. Since I had a assumption (pnp4nagios), I went straight for /var/lib …

That wasn’t it .. so heading to the next place, that’s suspicious most of the time, /var/log.

I was like “WTF ? 5.2G for YaST2 logs ?” when I initially saw that output … As of now, I got a crontab emptying /var/log/YaST2 every 24 hours …