TYPO3 – BAFM

TYPO3 hogging

April 7, 2008June 21, 2013 Christian 1 Comment

Well, we do appear to be having some strange load problems with our main TYPO3 box hosting several home pages of the local universities, as you can see below.

LOAD on t3node1 between 05:00-19:00 on 2008/04/07

We repeatedly tried to figure out which of them was the one responsible, but neither I nor the other Unix sysadmin knew a better way to figure out the load each TYPO3 installation was causing (since there ain’t no phptop or something similar). But since today the new semester started, we figured it might be good to finally figure which one it was. And a few minutes (as in one or two) wouldn’t be much of a problem compared to the advantage we’re getting out of it.

As a comparison, here’s the “normal” load for the last week:

LOAD on t3node1 between 2008/03/31 and 2008/04/07

So as a last resort (because of said load problems), we simply deactivated one vHost after another, until the load started to relax. Unsurprisingly it was one of the installations that had problems before. Let’s see whether or not the people over at said university are insightful or not … 😆

OCFS2 follow-up

March 7, 2008June 21, 2013 Christian Leave a comment

OK, it turned out that said colleague wasn’t responsible at all. Turns out, the *real* trigger was me creating a new volume on our SAN, on the same array that houses the OCFS2 volume.

Apparently, during creation of an additional SAN volume, all other SAN volumes in this array are either read-only or delayed during that time, as you can see from the following log:

kernel: (13,3):o2hb_write_timeout:242 ERROR: Heartbeat write timeout to device sdd1 after 12000 milliseconds
kernel: Heartbeat thread (13) printing last 24 blocking operations (cur = 4):
kernel: Heartbeat thread stuck at waiting for read completion, stuffing current time into that blocker (index 4)
kernel: Index 5: took 0 ms to do submit_bio for read
kernel: Index 6: took 0 ms to do waiting for read completion
kernel: Index 7: took 0 ms to do bio alloc write
kernel: Index 8: took 0 ms to do bio add page write
kernel: Index 9: took 0 ms to do submit_bio for write
kernel: Index 10: took 0 ms to do checking slots
kernel: Index 11: took 0 ms to do waiting for write completion
kernel: Index 12: took 2002 ms to do msleep
kernel: Index 13: took 0 ms to do allocating bios for read
kernel: Index 14: took 0 ms to do bio alloc read
kernel: Index 15: took 0 ms to do bio add page read
kernel: Index 16: took 0 ms to do submit_bio for read
kernel: Index 17: took 0 ms to do waiting for read completion
kernel: Index 18: took 0 ms to do bio alloc write
kernel: Index 19: took 0 ms to do bio add page write
kernel: Index 20: took 0 ms to do submit_bio for write
kernel: Index 21: took 0 ms to do checking slots
kernel: Index 22: took 0 ms to do waiting for write completion
kernel: Index 23: took 2004 ms to do msleep
kernel: Index 0: took 0 ms to do allocating bios for read
kernel: Index 1: took 0 ms to do bio alloc read
kernel: Index 2: took 0 ms to do bio add page read
kernel: Index 3: took 0 ms to do submit_bio for read
kernel: Index 4: took 9995 ms to do waiting for read completion
kernel: (13,3):o2hb_stop_all_regions:1682 ERROR: stopping heartbeat on all active regions.
kernel: Kernel panic - not syncing: *** ocfs2 is very sorry to be fencing this system by panicing ***

kernel: (13,3):o2hb_write_timeout:242 ERROR: Heartbeat write timeout to device sdd1 after 12000 milliseconds

kernel: Heartbeat thread (13) printing last 24 blocking operations (cur = 4):

kernel: Heartbeat thread stuck at waiting for read completion, stuffing current time into that blocker (index 4)

kernel: Index 5: took 0 ms to do submit_bio for read

kernel: Index 6: took 0 ms to do waiting for read completion

kernel: Index 7: took 0 ms to do bio alloc write

kernel: Index 8: took 0 ms to do bio add page write

kernel: Index 9: took 0 ms to do submit_bio for write

kernel: Index 10: took 0 ms to do checking slots

kernel: Index 11: took 0 ms to do waiting for write completion

kernel: Index 12: took 2002 ms to do msleep

kernel: Index 13: took 0 ms to do allocating bios for read

kernel: Index 14: took 0 ms to do bio alloc read

kernel: Index 15: took 0 ms to do bio add page read

kernel: Index 16: took 0 ms to do submit_bio for read

kernel: Index 17: took 0 ms to do waiting for read completion

kernel: Index 18: took 0 ms to do bio alloc write

kernel: Index 19: took 0 ms to do bio add page write

kernel: Index 20: took 0 ms to do submit_bio for write

kernel: Index 21: took 0 ms to do checking slots

kernel: Index 22: took 0 ms to do waiting for write completion

kernel: Index 23: took 2004 ms to do msleep

kernel: Index 0: took 0 ms to do allocating bios for read

kernel: Index 1: took 0 ms to do bio alloc read

kernel: Index 2: took 0 ms to do bio add page read

kernel: Index 3: took 0 ms to do submit_bio for read

kernel: Index 4: took 9995 ms to do waiting for read completion

kernel: (13,3):o2hb_stop_all_regions:1682 ERROR: stopping heartbeat on all active regions.

kernel: Kernel panic - not syncing: *** ocfs2 is very sorry to be fencing this system by panicing ***

OCFS2 fun

March 6, 2008August 16, 2014 Christian Leave a comment

Turns out, that said colleague has been playing with NFS on one off the web nodes, thus apparently rendering the remaining nodes offline (or semi-offline).

Now after all web nodes hung themselves, we had to hard reset them, now everything is tingly again .. *yay* for a great first day …

OCFS2 fun yet again

March 6, 2008June 21, 2013 Christian 1 Comment

I’m coming back today from a six day vacation in the warm south (that is Stuttgart), back at work and find three sheets of paper on my desk. Two tell me something I haven’t done yet, the other one tells me something I haven’t seen yet.

One of my colleagues had to restart one of our web nodes and now the thing can’t mount the logging volume (and thus, logrotate / awstats failed to do it’s job). OCFS2 ain’t spitting any error messages, when trying to mount the volume you see it joining the domain the volume belongs to on the other nodes, so from a first glance at things .. nothing is wrong ?

One thing I’ll have to add is, that you can’t reboot the box cleanly (as in you have to use the power button, so I figure something is either stuck or something is malfunctioning ..) *shrug*

Zend Optimizer again

February 19, 2008June 21, 2013 Christian Leave a comment

Well, I happen to be back at my favorite application. Today I stumbled upon a “nice” thing. If you turn on the Zend Optimizer (doesn’t matter whether it is 2.6.2 or 3.3.0), one of the TYPO3 back ends ain’t showing *any* content in the preview pane. Once you turn the Zend Optimizer stuff off, it works without a problem.

And as Zend stated on their “Support Forum“, they don’t really support the Zend Optimizer stuff in the first place. Which is nice, what for do you need the Zend Guard shit in the first place ??

Well, so I do have two options now:

Disable the one plug-in, which really needs the Zend Optimizer (as it also features the Zend De Guard engine – or whatever you want to call it)
or risk some other things breaking due to the Zend Optimizer engine not working (correctly) with php-5.1.2 (which is rather old considering 5.3.0 is in development right now)

But I will see about that tomorrow …

Been a while

February 16, 2008August 16, 2014 Christian 2 Comments

Well, it’s been quite a while since most of the people last heard a word from me. The last few months I’ve been extremely busy with work-related tasks (and as a side-effect of that, didn’t want to spend much time in front of the computer after 9 hours of work). I also started spending more and more time in the gym, like nearly two hours every Tuesday and Thursday.

I finally fixed our replication issues, we do now have a working! MySQL Multi-Master (1. Node, 2. Node — bear in mind, this boxes are *only* serving MySQL and nothing else, so don’t use these configurations on mixed setups) Replication Setup as database back end for our TYPO3-vHosts.
all the web nodes are now serving the content from a clustered, shared SAN volume (is that a good thing ? 😛 – don’t know yet …)
our VI environment is getting more and more acceptance (even if you hear some complaints now and then, like “awww, damn that crap my 4GiB RAM, 2×3.0GHz Windows 2008 is running soooo choppy” – simple answer, don’t use Windows Server 2008 and/or Windows Vista!)
I finished prepping our VM templates (at least the Windows ones)
we’re still putting together the plans on whether or not invest into a VDI solution.

The next few weeks are gonna be as frantic as the weeks before, I still have to migrate a lot of TYPO3 installations to our new cluster (which sadly needs time, as we need to wait for DNS changes to propagate). Honestly, I might be ending up extending the SAN volume for the MySQL data storage, as even with only three somewhat busy sites, the binary log of the last 5 days is about 2GiB in size. And we still have ~20 other busy sites on a separate box.

Lucky me, I created the MySQL data storage on a logical volume, so I can easily extend the volume in the san-manager semi-online (the fs needs to be unmounted and thus the MySQL process), then extend the physical volume (LVM2 PV) and the logical volume (LV) afterwards, and at last the underlying EXT3 file system.

As some of you know by now, I am on ~~extended leave for now~~. I don’t have tree access (at my own request), though I’m gonna try to keep up with Chris and 2008.0 … So long!

TYPO3 and MySQL replication

September 8, 2007June 21, 2013 Christian Leave a comment

Apparently the TYPO3 version we are using, doesn’t play too nice with the MySQL MasterMaster replication.

Sometimes, something like this is going to happen:

070826  0:44:32 [ERROR] Slave: Error &#039;Duplicate entry &#039;75-222419149&#039; for key 1&#039; on query. Default database: &#039;t3nb&#039;. Query: &#039;INSERT INTO cache_pagesection
070826  0:44:32 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with &quot;SLAVE START&quot;. We stopped at log &#039;dbc-mysql1.000192&#039; position 611861372

070826 0:44:32 [ERROR] Slave: Error 'Duplicate entry '75-222419149' for key 1' on query. Default database: 't3nb'. Query: 'INSERT INTO cache_pagesection

070826 0:44:32 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'dbc-mysql1.000192' position 611861372

Well, as you can see from the last line in the log, the Slave-SQL thread found a duplicate entry and thought it is smart to just turn off the thread instead of disregarding the just made entry. So from now on, both databases drift since there ain’t no replication anymore until someone kick starts the replication again (someone being me).

Anyway, I think I finally traced the fucker down, supposedly one of the problematic cases is located in t3lib/class.t3lib_tstemplate.php on line 362.

$GLOBALS[&#039;TYPO3_DB&#039;]-&gt;exec_DELETEquery(&#039;cache_pagesection&#039;, &#039;page_id=&#039;.intval($GLOBALS[&#039;TSFE&#039;]-&gt;id).&#039; AND mpvar_hash=&#039;.t3lib_div::md5int($GLOBALS[&#039;TSFE&#039;]-&gt;MP));
$GLOBALS[&#039;TYPO3_DB&#039;]-&gt;exec_INSERTquery(&#039;cache_pagesection&#039;, $insertFields);

1 2	$GLOBALS['TYPO3_DB']->exec_DELETEquery('cache_pagesection', 'page_id='.intval($GLOBALS['TSFE']->id).' AND mpvar_hash='.t3lib_div::md5int($GLOBALS['TSFE']->MP)); $GLOBALS['TYPO3_DB']->exec_INSERTquery('cache_pagesection', $insertFields);

Basically what TYPO3 is doing is a DELETE and an INSERT right afterwards. But apparently, it doesn’t check whether the DELETE even succeeded. I hacked it for now, simply adding this:

-                               $GLOBALS[&#039;TYPO3_DB&#039;]-&gt;exec_INSERTquery(&#039;cache_pagesection&#039;, $insertFields);
+                               // Only insert a new cache entry with the same value, if the DELETE succeeded
+                               if ($GLOBALS[&#039;TYPO3_DB&#039;]-&gt;sql_affected_rows() == 1)
+                                       $GLOBALS[&#039;TYPO3_DB&#039;]-&gt;exec_INSERTquery(&#039;cache_pagesection&#039;, $insertFields);
+

- $GLOBALS['TYPO3_DB']->exec_INSERTquery('cache_pagesection', $insertFields);

+ // Only insert a new cache entry with the same value, if the DELETE succeeded

+ if ($GLOBALS['TYPO3_DB']->sql_affected_rows() == 1)

+ $GLOBALS['TYPO3_DB']->exec_INSERTquery('cache_pagesection', $insertFields);

Sadly, this looks more and more like a race-condition between the two boxes (as in the replication / UPDATE being too slow), when users visit a edited site, that hasn’t had it’s cache regenerated yet. Problem is, it ain’t just this single spot, but also the search indexing, image cache and the whole page cache. For now we switched the cluster to active/passive load balancing, till we have a chance to see if a newer TYPO3 fixes those issues.

Bitching

March 10, 2006June 21, 2013 Christian Leave a comment

Once again, I’m compelled to play (other call it administering :P) with our TYPO3 cluster (which is sadly running SLES).

One thing I just learned about SLES (for the ones curious, its Novell’s SuSE Linux Enterprise Server and yes, it suffers the same pain as SuSE/openSuSE). They split one single config file (at least the apache2 one) into 9 (or more) different files.

Another thing is, for what the hell does a simple LAMP need a full blown Xorg w/ KDE installed ?

Good lord! Praise the USE-flags (f.e. -X or -kde)