April 3, 2009

Since I got the pleasure of watching some Windows boxen with Nagios, I took the Windows Update plugin from Michal Jankowski and implemented it. It took me some time, to initially set up the nsclient++ correctly so it just works, but up till now the check plugin sometimes reported the usual “Service Check Timed Out”.

Usually I ended up increasing the cscript timeout, or the nsclient++ socket timeout, but it still kept showing up. Since I rely heavily on my surveillance tools, I have the demand, that as few as possible false positives show up. So I ended up chasing down this error today, and after that I have to say it was quite simple.

In my case, it wasn’t cscript (that timeout is set to 300 seconds), neither nsclient++ (socket timeout is set to 300 seconds too), nor the nrpe plugin itself (that has 300 seconds as well).

As it turns out, Nagios got an additional setting controlling these things, called service_check_timeout which defaults to 60 seconds. Sadly the plugin, or rather Windows needs longer than those 60 seconds to figure out whether or not it needs updating, thus Nagios is killing the plugin and returning a CRITICAL message.

After increasing the value of service_check_timeout that’ll be fixed hopefully.

Well, I just stumbled upon something .. My Nagios at work wasn’t working anymore, and I went looking.

nagios3 ~ [0] &gt; tail -f /var/log/nagios/nagios.log
[1238658394] Error: Unable to save status file: No space left on device
[1238658403] Error: Unable to save status file: No space left on device
[1238658413] Error: Unable to save status file: No space left on device
[1238658423] SERVICE ALERT: tsm1;POWER WARN;OK;SOFT;4;-u OK - 0
[1238658423] Error: Unable to save status file: No space left on device
[1238658433] SERVICE ALERT: tsm2;LOAD;WARNING;SOFT;1;WARNING - load average: 6.25, 5.72, 5.36
[1238658433] Error: Unable to save status file: No space left on device
[1238658443] Error: Unable to save status file: No space left on device
[1238658453] Error: Unable to save status file: No space left on device
[1238658463] Error: Unable

nagios3 ~ [0] > tail -f /var/log/nagios/nagios.log

[1238658394] Error: Unable to save status file: No space left on device

[1238658403] Error: Unable to save status file: No space left on device

[1238658413] Error: Unable to save status file: No space left on device

[1238658423] SERVICE ALERT: tsm1;POWER WARN;OK;SOFT;4;-u OK - 0

[1238658423] Error: Unable to save status file: No space left on device

[1238658433] SERVICE ALERT: tsm2;LOAD;WARNING;SOFT;1;WARNING - load average: 6.25, 5.72, 5.36

[1238658433] Error: Unable to save status file: No space left on device

[1238658443] Error: Unable to save status file: No space left on device

[1238658453] Error: Unable to save status file: No space left on device

[1238658463] Error: Unable

After that, zip – nada. Next thing, check whether or not the device is really full … Okay, df ..

nagios3 ~ [130] &gt; df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/sda2             3.5G  1.2G  2.1G  37% /
udev                  506M   88K  506M   1% /dev
/dev/sdb1             7.9G  7.7G     0 100% /var

nagios3 ~ [130] > df -h

Filesystem Size Used Avail Use% Mounted on

/dev/sda2 3.5G 1.2G 2.1G 37% /

udev 506M 88K 506M 1% /dev

/dev/sdb1 7.9G 7.7G 0 100% /var

So, it is actually completely filled up. So, now we need to find who’s hogging the space. Since I had a assumption (pnp4nagios), I went straight for /var/lib …

nagios3 lib [0] &gt; du -sh *
16K     CAM
1.1M    YaST2
8.0K    acpi
4.0K    apache2
28K     autoinstall
16K     dhcpcd
4.0K    empty
96K     hardware
4.0K    logrotate.status
8.0K    misc
78M     mysql
2.1M    nagios
4.0K    net-snmp
4.0K    news
24K     nfs
8.0K    nobody
36K     ntp
4.0K    pam_devperm
824K    php5
359M    pnp4nagios
22M     rpm
28K     scpm
4.0K    smpppd
4.0K    sshd
4.0K    support
8.0K    suseRegister
4.0K    uniconf
4.0K    update-messages
4.0K    wwwrun
33M     zmd
14M     zypp

nagios3 lib [0] > du -sh *

16K CAM

1.1M YaST2

8.0K acpi

4.0K apache2

28K autoinstall

16K dhcpcd

4.0K empty

96K hardware

4.0K logrotate.status

8.0K misc

78M mysql

2.1M nagios

4.0K net-snmp

4.0K news

24K nfs

8.0K nobody

36K ntp

4.0K pam_devperm

824K php5

359M pnp4nagios

22M rpm

28K scpm

4.0K smpppd

4.0K sshd

4.0K support

8.0K suseRegister

4.0K uniconf

4.0K update-messages

4.0K wwwrun

33M zmd

14M zypp

That wasn’t it .. so heading to the next place, that’s suspicious most of the time, /var/log.

nagios3 log [0] &gt; du -sh *
5.2G    YaST2
4.0K    acpid
1.4G    apache2
28K     boot.msg
28K     boot.omsg
4.0K    cups
4.0K    dsmerror.log
148K    dsmsched.log
4.0K    faillog
4.0K    krb5
12K     lastlog
4.0K    localmessages
16K     mail
16K     mail.info
198M    messages
0       mysqld.log
14M     nagios
0       ntp
4.0K    pnp4nagios
4.0K    sa
8.0K    scpm
4.0K    vmdesched.log
16K     vmware-imc
4.0K    vmware-tools-guestd
82M     warn
348K    wtmp
115M    zmd-backend.log
24M     zmd-messages.log

nagios3 log [0] > du -sh *

5.2G YaST2

4.0K acpid

1.4G apache2

28K boot.msg

28K boot.omsg

4.0K cups

4.0K dsmerror.log

148K dsmsched.log

4.0K faillog

4.0K krb5

12K lastlog

4.0K localmessages

16K mail

16K mail.info

198M messages

0 mysqld.log

14M nagios

0 ntp

4.0K pnp4nagios

4.0K sa

8.0K scpm

4.0K vmdesched.log

16K vmware-imc

4.0K vmware-tools-guestd

82M warn

348K wtmp

115M zmd-backend.log

24M zmd-messages.log

I was like “WTF ? 5.2G for YaST2 logs ?” when I initially saw that output … As of now, I got a crontab emptying /var/log/YaST2 every 24 hours …

Day: April 3, 2009

Nagios: Service Check Timed Out

SLES10: zypper.log