There are three important changes between Augend and Borel today aimed at increasing stability with the Web server. We have been seeing crashes happen every few minutes on a worker for each server. These crashes are, as far as I aware, entirely random, which may be triggered during the 3 minute service check on http://<server>/phpinfo.php. When the service check fails due to a crash, a restart is issued. The crash however appears to be far-reaching, stalling the proper shutdown of all Apache workers. Instead of a normal restart happening then, the service monitor fails to restart the service properly and finalizes the restart during the next check 3 minutes later.

From prior knowledge, Zend Optimizer has caused these dangling restarts in the past, so this is the first thing to be upgraded on the two more problematic servers. Augend has also had ionCube temporarily removed as no one uses ionCube’s loader to load ionCube-encoded code on it. Hopefully one of these two is the culprit, with a simple fix.

Further, the amount of time Apache waits for a shutdown in the init script has been upped from 10 seconds to 30 seconds. If Apache does not shutdown properly within 30 seconds the shutdown/startup process will bail. Apache is configured with a graceful shutdown timeout of 15 seconds.  These three fixes should permanently resolve the erratic segfaults that we have seen predominantly on Augend and Borel.

– Matt

Update: just to clarify, the crashes affect a select few users.  One crash every 3 minutes serving 30 pages per second on average has a 1/5400 chance of being the one to crash.  PHP pages tend to have a higher frequency of encountering this bug, so that’s why I am looking into the trifecta of loaders/optimizers first.

Apache changes, temporary removal of ionCube, Zend Optimizer upgrade