Service monitor resiliency changes (MySQL connection counting, restart timing)

Two new changes have been pushed to the servers today aimed at increasing service resiliency in the event of a problem.

First, MySQL connection numbers are now counted and compared against normal connection rates to provide an extra layer of protection against the dreaded table-level lock that trickles out and locks all databases across all clients. Prior to today a simple, cacheless select query was performed, which worked well at determining if there were a problem within MySQL prior to the 5.0.67 upgrade. In addition to counting active connections, the query has been rewritten to require a temporary table to be created (included JOIN/SORT clauses) producing the average scenario for most MySQL queries.

Secondly, timestamps are now factored into service restarts. Two restarts within a 60 minute interval will dispatch a page to all of the technicians, myself included. In all cases, more than one restart within a hour is indicative of an underlying problem on the server, which requires administrative action to correct.

– Matt

Service monitor resiliency changes (MySQL connection counting, restart timing)