Nagios: check_snmp again

Well, today I had to grind my head again, regarding the way check_snmp handles WARNING and CRITICAL events. From my point of view, check_snmp is really just retarded sometimes.

As you know, all the other plugins accept WARNING and CRITICAL-thresholds based on the calculation, if the return integer is above this threshold it reached WARNING/CRITICAL state. But check_snmp doesn’t play that way.

It expects only ranges, which are NOT gonna result in warning or critical events. Which is kinda stupid, since you gotta rethink twice about the thresholds ๐Ÿ˜›

All in all, another lesson learned ๐Ÿ˜ฎ

Nagios: NSclient++ in a clustered Environment

Well, most of you already know that I’m a Nagios fanatic. I like to watch as many aspects as I possibly can. So, yesterday I started figuring out ways to watch our different cluster groups (housing a bunch — try above 20.000 — of file shares).

Now, my first tries failed horribly. I brought down a complete cluster group, resulting in a major annoyance. Now, today I went at it a bit smarter ๐Ÿ˜› I cloned myself two VM’s off my Windows Server 2003 Enterprise R2 template, created a new cluster.

After that, I tried it on the test cluster again, same result. The resource is successfully created, but once I try to take it online, it breaks and moves the whole cluster group to the other node (as cyclic moving between the cluster nodes with no end).

After that, I figured something has to be wrong with the command I’m trying to use, the one as instructed by the NSClient++ wiki. I then tried the command on the command line, but as soon as hitting <TAB> (oooold bash habit ๐Ÿ˜› ), it completed the path, but put quotes around it … Don’t ask me.

If I try the path without the quotes, no-joy at all. Once you put quotes around it, everything becomes honky-dory and the resource comes online without the slightest trouble!

Hint to self: When creating a NSClient++ cluster resource (or any application resource using a command that needs switches for that matter), use a quoted command line along the lines of this:

VMware: New VirtualCenter 2.5 Update 4

As many people on the VM-Planet already blogged about this, I ain’t gonna write just about it. Let’s turn the clock back a few months, to January 2008.

As the institution I work for, is part of the DFN we took the opportunity to be a part of the “I want you to run our RA“-gang. In January 2008 we thought about changing the vCenter certificate. Now, apparently there’s a slight difference between the DFN-PCA and what VMware considers common practice.

The DFN-PCA states, that only CSR’s with a key length of 2048 bits are allowed (as outlined in 6.1.5 of the DFN-PKI Certificate Policy). Now VMware apparently didn’t actually think customers would use this “feature” (that is changing the SSL certificates).

Customization Specifications Created in Previous Releases Can Be Used in VirtualCenter 2.5 Update 4 to Clone or Deploy Virtual Machine with Customized Guest Operating Systems
This release resolves an issue where, if you clone or deploy a virtual machine using a customization specification that was created prior to upgrading the VirtualCenter, the VirtualCenter Server might display the error message The VirtualCenter server is unable to decrypt the passwords stored in the customization specification in the following scenarios:

  1. VirtualCenter Server is uninstalled first, and then re-installed and/or upgraded afterwards.
  2. Custom SSL certificate are deployed, but the instruction in http://www.vmware.com/pdf/vi_vcserver_certificates.pdf are not followed in a verbatim manner.

Well, and apparently it ain’t fixed yet. At least not for us ๐Ÿ˜•

MySQL: Beware of sync_binlog on EXT3

Well, I just glazed again over my my.cnf for our web-cluster because I just moved a database from one cluster to another and getting quite different performance from it. So, as I expected, there is a slight difference between both configuration files:

And apparently, according to the MySQL Performance Blog that’s really, really bad (as well, we’re currently running without write caching, as the battery module of the storage is dead).

Tivoli Storage Manager Client and Microsoft Cluster Services (continued)

As you might recall from my first article about this topic, I had some troubles with the Microsoft Cluster Services and the registration replication. Now, today as we tried switching the TSM-Server for some resources, we ran into this again.

We were using the service install tool (dsmcutil install scheduler) to set the new password as well as the GUI. Now, as we brought the resource online with the local service manager, everything was honky dory. But as soon as we brought it online using the Cluster Manager, it failed horribly. Why ?

Well, as I read the Microsoft KB the last time, I started remembering something about the replication.

  • When the resource goes online, the registry keys are updated with the previously checkpointed information.
  • When the resource is brought offline, all the checkpoints associated with this resource are saved.

If you manually update these registry keys while the application or service is offline, the changes may not be replicated or may be lost. To prevent this from happening, make any manual changes while the service or application resource is online.

Simply put, when you toggle the resource offline, the cluster saves the registry from the currently running node onto the quorum (checkpoints). As we changed those settings while the resource was offline, it discarded them, as we toggled it back online with the Cluster Manager.

Simple solution: just remove the registry replication parameter when the resource is offline (and click “Apply” and “OK” afterwards). After that update the registry on the cluster node currently owning the physical disk drive (either using the GUI or dsmcutil). Afterwards, re-add the registration key and you should be able to “force” the Microsoft Cluster into thinking that the registry you have on this cluster node is the valid one.

MySQL: Replication and hostname wild cards

Yeah, yeah .. I know, it’s weekend. But I usually can think much better when no one is rattling my cage. So I had another look at my replication problems.

  1. Don’t you never ever change InnoDB settings when migrating between hardware,
    because InnoDB is rather sensitive regarding those parameters.
  2. When you’re setting up the replication (don’t ask me why) and copying over the database to the second replication partner, be aware if you’re using wild cards you’re gonna get seriously bitten in the back.

Now, let’s look at the constellation.

mysql-nodes

As you can see on the graph above (hah, sometimes Visio is rather useful ๐Ÿ˜› ), we do have two MySQL nodes, each serving as master (as in we’re doing “normal” master-master replication).

Here’s what we’re gonna do first:

  1. Setup the user mysql_repl for mysql%.home.barfoo.org, granting REPLICATION SLAVE.
  2. Setup the user mysql_slave for mysql1.home.barfoo.org and mysql2.home.barfoo.org, also granting REPLICATION SLAVE.

Afterwards, we’re gonna copy the mysql database (either via tar and scp, or just via rssh — which is rsync via ssh) to both nodes. Read More

Nagios: Integrating Cisco switches

Well, as I wrote recently, we received a new BladeCenter a few weeks back. Now, as we slowly take it into service I was interested in watching the utilization of the back planes as well as the CPU utilization of the Cisco Catalyst 3012 network switches.

The first mistake I made, was to trust Cisco with their guide about how to get the utilization from the device using SNMP. They stated some OID’s, which I tried with snmpwalk and got a result from.

Now, as I tried retrieving the SNMP data by means of the check_snmp plugin, I got some flaky results:

Those of you, who read the excerpts carefully will notice the difference between snmpwalk and the OID I passed on to check_snmp.

The point being, the OID’s Cisco gave in their Design tech notes are either old, or just not accurate at all. After passing on the .0 to each value given by Cisco, the check_snmp is all honky dory and integrated into Nagios.

As usual, the Nagios definitions are further down, for those interested. Read More

Linux: Getting information about an EXT3 filesystem

You know, I’m not getting any younger. It’s getting harder remembering every damn command … so here is how you get information out of your EXT3 filesystem:

Restarting the NSclient++ service without the management applet

For people, who are as click and point-lazy as me, here is how you restart the service without using the service management applet.

MySQL: Setting up an InnoDB raw device

Well, since I had to brood about this (again I might add), I’m gonna write it down this time …

Setting up the InnoDB raw device isn’t that hard, just make sure the device has proper permissions (either add mysql to the disk group or create a udev rule).

Now after that (and a reboot/udevcontrol reload_rules later), you should be able to initialize the InnoDB device. Yes, the InnoDB device needs initializing.

When you create a new data file, you must put the keyword newraw immediately after the data file size in innodb_data_file_path.

The next time you start the server, InnoDB notices the newraw keyword and initializes the new partition.

After that is done, you should be able to start the MySQL service for the first time. It is gonna fail (at least according to the init-script), but ultimatly if you take a closer look at /var/log/mysqld.log it’s gonna be successful.

After that, remove the “newraw” from your /etc/my.cnf. Otherwise, MySQL is gonna reinitialize the volume all over again, as the handbook states.

However, do not create or change any InnoDB tables yet. Otherwise, when you next restart the server, InnoDB reinitializes the partition and your changes are lost.

After InnoDB has initialized the new partition, stop the server, change newraw in the data file specification to raw.