February 2009

Nagios: check_snmp again

February 27, 2009June 21, 2013 Christian Leave a comment

Well, today I had to grind my head again, regarding the way check_snmp handles WARNING and CRITICAL events. From my point of view, check_snmp is really just retarded sometimes.

As you know, all the other plugins accept WARNING and CRITICAL-thresholds based on the calculation, if the return integer is above this threshold it reached WARNING/CRITICAL state. But check_snmp doesn’t play that way.

It expects only ranges, which are NOT gonna result in warning or critical events. Which is kinda stupid, since you gotta rethink twice about the thresholds 😛

define service {
  use                   generic-service
  host_name             ibm-bc1-mgmt
  service_description   Chassis Cooling - Bay 1
  check_command         check_snmpv1_public!.1.3.6.1.4.1.2.3.51.2.2.3.20.0!
                                            1900:8000!1900:0,10000:8000!
                                            RPM!Chassis Cooling - Bay 1
  action_url            /pnp/index.php?host=$HOSTNAME$&amp;srv=$SERVICEDESC$
  notes                 View PNP RRD grap
}

define service {

use generic-service

host_name ibm-bc1-mgmt

service_description Chassis Cooling - Bay 1

check_command check_snmpv1_public!.1.3.6.1.4.1.2.3.51.2.2.3.20.0!

1900:8000!1900:0,10000:8000!

RPM!Chassis Cooling - Bay 1

action_url /pnp/index.php?host=$HOSTNAME$&srv=$SERVICEDESC$

notes View PNP RRD grap

}

All in all, another lesson learned 😮

Nagios: NSclient++ in a clustered Environment

February 26, 2009June 21, 2013 Christian Leave a comment

Well, most of you already know that I’m a Nagios fanatic. I like to watch as many aspects as I possibly can. So, yesterday I started figuring out ways to watch our different cluster groups (housing a bunch — try above 20.000 — of file shares).

Now, my first tries failed horribly. I brought down a complete cluster group, resulting in a major annoyance. Now, today I went at it a bit smarter 😛 I cloned myself two VM’s off my Windows Server 2003 Enterprise R2 template, created a new cluster.

After that, I tried it on the test cluster again, same result. The resource is successfully created, but once I try to take it online, it breaks and moves the whole cluster group to the other node (as cyclic moving between the cluster nodes with no end).

After that, I figured something has to be wrong with the command I’m trying to use, the one as instructed by the NSClient++ wiki. I then tried the command on the command line, but as soon as hitting <TAB> (oooold bash habit 😛 ), it completed the path, but put quotes around it … Don’t ask me.

If I try the path without the quotes, no-joy at all. Once you put quotes around it, everything becomes honky-dory and the resource comes online without the slightest trouble!

Hint to self: When creating a NSClient++ cluster resource (or any application resource using a command that needs switches for that matter), use a quoted command line along the lines of this:

&quot;Q:_nsclientnsclient.exe&quot; /test

1	"Q:_nsclientnsclient.exe" /test

VMware: New VirtualCenter 2.5 Update 4

February 25, 2009June 21, 2013 Christian 1 Comment

As many people on the VM-Planet already blogged about this, I ain’t gonna write just about it. Let’s turn the clock back a few months, to January 2008.

As the institution I work for, is part of the DFN we took the opportunity to be a part of the “I want you to run our RA“-gang. In January 2008 we thought about changing the vCenter certificate. Now, apparently there’s a slight difference between the DFN-PCA and what VMware considers common practice.

The DFN-PCA states, that only CSR’s with a key length of 2048 bits are allowed (as outlined in 6.1.5 of the DFN-PKI Certificate Policy). Now VMware apparently didn’t actually think customers would use this “feature” (that is changing the SSL certificates).

Customization Specifications Created in Previous Releases Can Be Used in VirtualCenter 2.5 Update 4 to Clone or Deploy Virtual Machine with Customized Guest Operating Systems
This release resolves an issue where, if you clone or deploy a virtual machine using a customization specification that was created prior to upgrading the VirtualCenter, the VirtualCenter Server might display the error message The VirtualCenter server is unable to decrypt the passwords stored in the customization specification in the following scenarios:

VirtualCenter Server is uninstalled first, and then re-installed and/or upgraded afterwards.

Custom SSL certificate are deployed, but the instruction in http://www.vmware.com/pdf/vi_vcserver_certificates.pdf are not followed in a verbatim manner.

Well, and apparently it ain’t fixed yet. At least not for us 😕

MySQL: Beware of sync_binlog on EXT3

February 23, 2009June 21, 2013 Christian Leave a comment

Well, I just glazed again over my my.cnf for our web-cluster because I just moved a database from one cluster to another and getting quite different performance from it. So, as I expected, there is a slight difference between both configuration files:

@@ -55,8 +58,6 @@
 innodb_log_group_home_dir       = /var/lib/mysql/db
 innodb_log_file_size            = 512M
 innodb_thread_concurrency       = 8
-sync_binlog                     = 1

@@ -55,8 +58,6 @@

innodb_log_group_home_dir = /var/lib/mysql/db

innodb_log_file_size = 512M

innodb_thread_concurrency = 8

-sync_binlog = 1

And apparently, according to the MySQL Performance Blog that’s really, really bad (as well, we’re currently running without write caching, as the battery module of the storage is dead).

Tivoli Storage Manager Client and Microsoft Cluster Services (continued)

February 16, 2009August 8, 2014 Christian Leave a comment

As you might recall from my first article about this topic, I had some troubles with the Microsoft Cluster Services and the registration replication. Now, today as we tried switching the TSM-Server for some resources, we ran into this again.

We were using the service install tool (dsmcutil install scheduler) to set the new password as well as the GUI. Now, as we brought the resource online with the local service manager, everything was honky dory. But as soon as we brought it online using the Cluster Manager, it failed horribly. Why ?

Well, as I read the Microsoft KB the last time, I started remembering something about the replication.

When the resource goes online, the registry keys are updated with the previously checkpointed information.

When the resource is brought offline, all the checkpoints associated with this resource are saved.

If you manually update these registry keys while the application or service is offline, the changes may not be replicated or may be lost. To prevent this from happening, make any manual changes while the service or application resource is online.

Simply put, when you toggle the resource offline, the cluster saves the registry from the currently running node onto the quorum (checkpoints). As we changed those settings while the resource was offline, it discarded them, as we toggled it back online with the Cluster Manager.

Simple solution: just remove the registry replication parameter when the resource is offline (and click “Apply” and “OK” afterwards). After that update the registry on the cluster node currently owning the physical disk drive (either using the GUI or dsmcutil). Afterwards, re-add the registration key and you should be able to “force” the Microsoft Cluster into thinking that the registry you have on this cluster node is the valid one.

MySQL: Replication and hostname wild cards

February 15, 2009June 21, 2013 Christian Leave a comment

Yeah, yeah .. I know, it’s weekend. But I usually can think much better when no one is rattling my cage. So I had another look at my replication problems.

Don’t you never ever change InnoDB settings when migrating between hardware,
because InnoDB is rather sensitive regarding those parameters.
When you’re setting up the replication (don’t ask me why) and copying over the database to the second replication partner, be aware if you’re using wild cards you’re gonna get seriously bitten in the back.

Now, let’s look at the constellation.

As you can see on the graph above (hah, sometimes Visio is rather useful 😛 ), we do have two MySQL nodes, each serving as master (as in we’re doing “normal” master-master replication).

Here’s what we’re gonna do first:

Setup the user mysql_repl for mysql%.home.barfoo.org, granting REPLICATION SLAVE.
Setup the user mysql_slave for mysql1.home.barfoo.org and mysql2.home.barfoo.org, also granting REPLICATION SLAVE.

Afterwards, we’re gonna copy the mysql database (either via tar and scp, or just via rssh — which is rsync via ssh) to both nodes. Read More

Nagios: Integrating Cisco switches

February 13, 2009June 21, 2013 Christian 1 Comment

Well, as I wrote recently, we received a new BladeCenter a few weeks back. Now, as we slowly take it into service I was interested in watching the utilization of the back planes as well as the CPU utilization of the Cisco Catalyst 3012 network switches.

The first mistake I made, was to trust Cisco with their guide about how to get the utilization from the device using SNMP. They stated some OID’s, which I tried with snmpwalk and got a result from.

; html-script: false ]snmpwalk -v1 -c public -O n 10.0.0.35 .1.3.6.1.4.1.9.5.1.1.8
.1.3.6.1.4.1.9.5.1.1.8.0 = INTEGER: 0

1 2	; html-script: false ]snmpwalk -v1 -c public -O n 10.0.0.35 .1.3.6.1.4.1.9.5.1.1.8 .1.3.6.1.4.1.9.5.1.1.8.0 = INTEGER: 0

Now, as I tried retrieving the SNMP data by means of the check_snmp plugin, I got some flaky results:

; html-script: false ]/usr/lib/nagios/plugins/check_snmp -H 10.0.0.35 -C public 
                                   .1.3.6.1.4.1.9.5.1.1.8
SNMP problem - No data received from host
CMD: /usr/bin/snmpget -t 1 -r 5 -m &#039;&#039; -v 1 [authpriv] 10.0.0.35:161

; html-script: false ]/usr/lib/nagios/plugins/check_snmp -H 10.0.0.35 -C public

.1.3.6.1.4.1.9.5.1.1.8

SNMP problem - No data received from host

CMD: /usr/bin/snmpget -t 1 -r 5 -m '' -v 1 [authpriv] 10.0.0.35:161

Those of you, who read the excerpts carefully will notice the difference between snmpwalk and the OID I passed on to check_snmp.

The point being, the OID’s Cisco gave in their Design tech notes are either old, or just not accurate at all. After passing on the .0 to each value given by Cisco, the check_snmp is all honky dory and integrated into Nagios.

As usual, the Nagios definitions are further down, for those interested. Read More

Linux: Getting information about an EXT3 filesystem

February 13, 2009June 21, 2013 Christian Leave a comment

You know, I’m not getting any younger. It’s getting harder remembering every damn command … so here is how you get information out of your EXT3 filesystem:

sles10sp2 ~ [0] &gt; tune2fs -l /dev/sda2 | grep &quot;^Filesystem&quot;
Filesystem volume name:   &lt;none&gt;
Filesystem UUID:          8eec8235-4d9e-4b58-acf9-3c68c977d5ea
Filesystem magic number:  0xEF53
Filesystem revision #:    1 (dynamic)
Filesystem features:      has_journal resize_inode filetype 
                          needs_recovery sparse_super large_file
Filesystem state:         clean
Filesystem OS type:       Linux
Filesystem created:       Tue May 27 10:48:56 2008

sles10sp2 ~ [0] > tune2fs -l /dev/sda2 | grep "^Filesystem"

Filesystem volume name: <none>

Filesystem UUID: 8eec8235-4d9e-4b58-acf9-3c68c977d5ea

Filesystem magic number: 0xEF53

Filesystem revision #: 1 (dynamic)

Filesystem features: has_journal resize_inode filetype

needs_recovery sparse_super large_file

Filesystem state: clean

Filesystem OS type: Linux

Filesystem created: Tue May 27 10:48:56 2008

Restarting the NSclient++ service without the management applet

February 11, 2009June 21, 2013 Christian Leave a comment

For people, who are as click and point-lazy as me, here is how you restart the service without using the service management applet.

net stop &quot;NSClientpp (Nagios) 0.3.5.2 2008-09-24 w32&quot;
net start &quot;NSClientpp (Nagios) 0.3.5.2 2008-09-24 w32&quot;

1 2	net stop "NSClientpp (Nagios) 0.3.5.2 2008-09-24 w32" net start "NSClientpp (Nagios) 0.3.5.2 2008-09-24 w32"

MySQL: Setting up an InnoDB raw device

February 11, 2009June 21, 2013 Christian Leave a comment

Well, since I had to brood about this (again I might add), I’m gonna write it down this time …

Setting up the InnoDB raw device isn’t that hard, just make sure the device has proper permissions (either add mysql to the disk group or create a udev rule).

KERNEL=&quot;sdb2&quot;, OWNER=&quot;mysql&quot;, GROUP=&quot;mysql&quot;

1	KERNEL="sdb2", OWNER="mysql", GROUP="mysql"

Now after that (and a reboot/udevcontrol reload_rules later), you should be able to initialize the InnoDB device. Yes, the InnoDB device needs initializing.

When you create a new data file, you must put the keyword newraw immediately after the data file size in innodb_data_file_path.

The next time you start the server, InnoDB notices the newraw keyword and initializes the new partition.

After that is done, you should be able to start the MySQL service for the first time. It is gonna fail (at least according to the init-script), but ultimatly if you take a closer look at /var/log/mysqld.log it’s gonna be successful.

14:11:07  InnoDB: Setting file /dev/sdb2 size to 50200 MB
InnoDB: Database physically writes the file full: wait...
InnoDB: Progress in MB: 100 200 300 400 500 600 700 800 900 1000 ... 50200
14:21:26  InnoDB: Log file ib_logfile0 did not exist: new to be created
InnoDB: Setting log file /var/lib/mysql/db/ib_logfile0 size to 512 MB
InnoDB: Database physically writes the file full: wait...
InnoDB: Progress in MB: 100 200 300 400 500
14:21:34  InnoDB: Log file ib_logfile1 did not exist: new to be created
InnoDB: Setting log file ib_logfile1 size to 512 MB
InnoDB: Database physically writes the file full: wait...
InnoDB: Progress in MB: 100 200 300 400 500
InnoDB: Doublewrite buffer not found: creating new
InnoDB: Doublewrite buffer created
InnoDB: Creating foreign key constraint system tables
InnoDB: Foreign key constraint system tables created
14:21:42  InnoDB: Started; log sequence number 0 0

14:11:07 InnoDB: Setting file /dev/sdb2 size to 50200 MB

InnoDB: Database physically writes the file full: wait...

InnoDB: Progress in MB: 100 200 300 400 500 600 700 800 900 1000 ... 50200

14:21:26 InnoDB: Log file ib_logfile0 did not exist: new to be created

InnoDB: Setting log file /var/lib/mysql/db/ib_logfile0 size to 512 MB

InnoDB: Database physically writes the file full: wait...

InnoDB: Progress in MB: 100 200 300 400 500

14:21:34 InnoDB: Log file ib_logfile1 did not exist: new to be created

InnoDB: Setting log file ib_logfile1 size to 512 MB

InnoDB: Database physically writes the file full: wait...

InnoDB: Progress in MB: 100 200 300 400 500

InnoDB: Doublewrite buffer not found: creating new

InnoDB: Doublewrite buffer created

InnoDB: Creating foreign key constraint system tables

InnoDB: Foreign key constraint system tables created

14:21:42 InnoDB: Started; log sequence number 0 0

After that, remove the “newraw” from your /etc/my.cnf. Otherwise, MySQL is gonna reinitialize the volume all over again, as the handbook states.

However, do not create or change any InnoDB tables yet. Otherwise, when you next restart the server, InnoDB reinitializes the partition and your changes are lost.

After InnoDB has initialized the new partition, stop the server, change newraw in the data file specification to raw.