MySQL: Beware of sync_binlog on EXT3

Well, I just glazed again over my my.cnf for our web-cluster because I just moved a database from one cluster to another and getting quite different performance from it. So, as I expected, there is a slight difference between both configuration files:

And apparently, according to the MySQL Performance Blog that’s really, really bad (as well, we’re currently running without write caching, as the battery module of the storage is dead).

MySQL: Replication and hostname wild cards

Yeah, yeah .. I know, it’s weekend. But I usually can think much better when no one is rattling my cage. So I had another look at my replication problems.

  1. Don’t you never ever change InnoDB settings when migrating between hardware,
    because InnoDB is rather sensitive regarding those parameters.
  2. When you’re setting up the replication (don’t ask me why) and copying over the database to the second replication partner, be aware if you’re using wild cards you’re gonna get seriously bitten in the back.

Now, let’s look at the constellation.

mysql-nodes

As you can see on the graph above (hah, sometimes Visio is rather useful 😛 ), we do have two MySQL nodes, each serving as master (as in we’re doing “normal” master-master replication).

Here’s what we’re gonna do first:

  1. Setup the user mysql_repl for mysql%.home.barfoo.org, granting REPLICATION SLAVE.
  2. Setup the user mysql_slave for mysql1.home.barfoo.org and mysql2.home.barfoo.org, also granting REPLICATION SLAVE.

Afterwards, we’re gonna copy the mysql database (either via tar and scp, or just via rssh — which is rsync via ssh) to both nodes. Read More

MySQL: Setting up an InnoDB raw device

Well, since I had to brood about this (again I might add), I’m gonna write it down this time …

Setting up the InnoDB raw device isn’t that hard, just make sure the device has proper permissions (either add mysql to the disk group or create a udev rule).

Now after that (and a reboot/udevcontrol reload_rules later), you should be able to initialize the InnoDB device. Yes, the InnoDB device needs initializing.

When you create a new data file, you must put the keyword newraw immediately after the data file size in innodb_data_file_path.

The next time you start the server, InnoDB notices the newraw keyword and initializes the new partition.

After that is done, you should be able to start the MySQL service for the first time. It is gonna fail (at least according to the init-script), but ultimatly if you take a closer look at /var/log/mysqld.log it’s gonna be successful.

After that, remove the “newraw” from your /etc/my.cnf. Otherwise, MySQL is gonna reinitialize the volume all over again, as the handbook states.

However, do not create or change any InnoDB tables yet. Otherwise, when you next restart the server, InnoDB reinitializes the partition and your changes are lost.

After InnoDB has initialized the new partition, stop the server, change newraw in the data file specification to raw.

Defragmenting all fragmented MyISAM tables

I just had another look at what I wrote the week before last (you know, being home-sick/on vacation has it’s advantages) and additionally read up on “OPTIMIZE TABLE” again. The comments in the manual mention “SHOW TABLE STATUS“, which gives you a complete list, but it doesn’t allow you to filter certain kinds of things out (like I only wanted to see MyISAM tables in the list, I only wanted database and table).

So I went ahead and looked around in MySQL’s own databases and if you look closely at information_schema, it’s got a list of all databases/tables with an additional pointer whether or not databases are fragmented, the row Data_free. I only found this, because I looked at how the mysqltuner figured whether or not you have fragmented tables.

So, without further ado, here’s the final script I’m gonna torture for the next week:

I know it ain’t completely bullet proof and it sure as hell isn’t neat, but I think it does the job. Also, if you don’t want to paste it, here’s the file download.

Been a while

Well, it’s been quite a while since most of the people last heard a word from me. The last few months I’ve been extremely busy with work-related tasks (and as a side-effect of that, didn’t want to spend much time in front of the computer after 9 hours of work). I also started spending more and more time in the gym, like nearly two hours every Tuesday and Thursday.

  • I finally fixed our replication issues, we do now have a working! MySQL Multi-Master (1. Node, 2. Node — bear in mind, this boxes are *only* serving MySQL and nothing else, so don’t use these configurations on mixed setups) Replication Setup as database back end for our TYPO3-vHosts.
  • all the web nodes are now serving the content from a clustered, shared SAN volume (is that a good thing ? 😛 – don’t know yet …)
  • our VI environment is getting more and more acceptance (even if you hear some complaints now and then, like “awww, damn that crap my 4GiB RAM, 2×3.0GHz Windows 2008 is running soooo choppy” – simple answer, don’t use Windows Server 2008 and/or Windows Vista!)
  • I finished prepping our VM templates (at least the Windows ones)
  • we’re still putting together the plans on whether or not invest into a VDI solution.

The next few weeks are gonna be as frantic as the weeks before, I still have to migrate a lot of TYPO3 installations to our new cluster (which sadly needs time, as we need to wait for DNS changes to propagate). Honestly, I might be ending up extending the SAN volume for the MySQL data storage, as even with only three somewhat busy sites, the binary log of the last 5 days is about 2GiB in size. And we still have ~20 other busy sites on a separate box.

Lucky me, I created the MySQL data storage on a logical volume, so I can easily extend the volume in the san-manager semi-online (the fs needs to be unmounted and thus the MySQL process), then extend the physical volume (LVM2 PV) and the logical volume (LV) afterwards, and at last the underlying EXT3 file system.

As some of you know by now, I am on extended leave for now. I don’t have tree access (at my own request), though I’m gonna try to keep up with Chris and 2008.0 … So long!

TYPO3 and MySQL replication

Apparently the TYPO3 version we are using, doesn’t play too nice with the MySQL MasterMaster replication.

Sometimes, something like this is going to happen:

Well, as you can see from the last line in the log, the Slave-SQL thread found a duplicate entry and thought it is smart to just turn off the thread instead of disregarding the just made entry. So from now on, both databases drift since there ain’t no replication anymore until someone kick starts the replication again (someone being me).

Anyway, I think I finally traced the fucker down, supposedly one of the problematic cases is located in t3lib/class.t3lib_tstemplate.php on line 362.

Basically what TYPO3 is doing is a DELETE and an INSERT right afterwards. But apparently, it doesn’t check whether the DELETE even succeeded. I hacked it for now, simply adding this:

Sadly, this looks more and more like a race-condition between the two boxes (as in the replication / UPDATE being too slow), when users visit a edited site, that hasn’t had it’s cache regenerated yet. Problem is, it ain’t just this single spot, but also the search indexing, image cache and the whole page cache. For now we switched the cluster to active/passive load balancing, till we have a chance to see if a newer TYPO3 fixes those issues.

Continuing on SLES10

OK, it turns out that I was rather stupid when configuring the my.cnf. As it turned out, the effect I was seeing was due to the presence of two log-bin lines, which looked like the following:

And some lines down there was this:

Now the next thing I encountered was while importing our old databases (they are like 1.1GiB each, 25 databases total). The second MySQL Master (and his Slave) will choke as soon as you dump the data too fast into the first Master, as the binlog seems to be too big for MySQL to transfer it via TCP (smth like “Packet too large – try increasing max_packet_size” in the error-log; only problem was that max_packet_size was already at 1GiB which is the absolut maximum for MySQL 5.0 according to the handbook).

A way around this (thanks to a co-worker who pushed me towards this road) is disabling all the MySQL Master/Slave stuff in your my.cnf, start the mysql daemon as a simple, dumb database, import all your databases, stop the mysql daemon; tar up the whole BASEDIR and scp/rssh it to your second master.

Clean out the BASEDIR on the second master, untar your tarball, edit your my.cnf again to include the whole Master/Slave portions on both boxes and you should be up and running 😀

I haven’t run any tests on the MasterMaster replication yet, but I’ll do that as soon as I’m at work again (which is the 27th June, as I’m off for vacation since yesterday, yay!)

Back at SLES10

Here I am, sitting at my desk on a Thuesday evening thinking about what happened the last few days.

  1. I finally got to play around with our PacketPro 450 Cluster (nifty LoadBalancing appliance)
  2. We reworked the network the way *we* want it (and not that tool of a wannabe sysadmin)
  3. We mostly figured out how to do the LoadBalancing right, we just need to find some bugs in the LoadBalancer software (like the thing is failing over to its slave from time to time, but keeping the IP address for himself) or let the guys at teamix do their work and hopefully get a working release within the next week or so
  4. I figured out how to setup interface bonding with SLES10 (it was quite straight forward, thanks to the excellent in-kernel documentation), and we’re using an active-backup mode for now
  5. I still need to figure out how to do the MySQL Master<->Master replication right .. I’m currently building fresh RPM’s on one of those Dell blades (yes, they ROCK!) which will hopefully be finished till I’m at the office tomorrow.
  6. Pt. 5 also includes figuring out how to pass MySQL a custom location for the binary-log, at least that’s what the handbook says in Chapter “5.11.3. The Binary Log” …

When started with the –log-bin[=base_name] option, mysqld writes a log file containing all SQL commands that update data. If no base_name value is given, the default name is the name of the host machine followed by -bin. If the basename is given, but not as an absolute pathname, the server writes the file in the data directory. It is recommended that you specify a basename; see Section B.1.8.1, “Open Issues in MySQL”, for the reason.

That behavior works for –log-bin-index (like log-bin-index=/mysql/binlogs/$HOSTNAME.idx), but doesn’t for –log-bin. *shrug* I’ll see if that is fixed with something >5.0.18 (that’s what SLES10 currently ships).

I’m also looking for a network topology drawing program (possibly free), as Microsoft Visio (either 2003 or 2007, Standard or Professional) is nice, but still can’t draw shit correctly. So I stumbled upon yEd, which seems to look nice (I haven’t yet looked at it, but will tomorrow) that hopefully gives me the opportunity to draw/visualize my setup at work 😯

Waiting

We are still waiting for the money promised by the state and the country for our HBFG (again, it’s “Hochschulbauförderungsgesetz”), that hopefully is reducing or eliminating our storage/SAN problem we have currently. Right now we have to Cisco MDS9216 (that’s a 16-port 2GBps SAN-switch, two for redundancy), which means we only have 16 SAN-ports. That isn’t much, but still is to less, as we have like 30 machines or so, that *really* need access to the SAN, so we either end up unplugging some of them from the SAN or merge them onto some big machines (like our x366).

The other side of the problem is the storage .. Currently that isn’t redundant, which means we’re fucked if the storage decides to not come up, or one of the controller smokes .. So were looking at two DS4700 with 2 enclosures each filled with 300GB 2GBps FC disks. That will hopefully also solve our constant lack of rackspace.

Apart from that, we took a look at the terminal server market, heard someone from Citrix, looked ourselves at 2X (and I think we are going with the 2X solution – even if they don’t support the authentication passthrough – yet). We might want to consider buying dedicated hardware for the terminal servers, as I implemented them running on the ESX which isn’t a permanent solution, as at least the students will work on those terminal servers 0700-2200, that means continuous load in that time, which isn’t good for the ESX Cluster, as they are pretty loaded already.

We’re also looking in buying a third box for the ESX Cluster, probably one of the same as we have currently (that is x366 – with 2 DC Xeon’s, 16GB RAM, 2×73 GB SAS, 2x dual-port Intel NIC, 2x dual-port FC HBA) to get some extra capacity.

Recently I did some experiments with Gentoo as MySQL cluster (master< ->master replication for our upcoming database servers – that’s what the blade chassis and the two blades are for) and noticed that the Gentoo VM’s were sucking up RAM and didn’t release it, so I had to reset them every morning, in order to free some RAM. I guess I should poke Chris a bit about that, as he told me back at FOSDEM that he was doing some load testing with a similar setup not so far ago.