Unexpected Outage

Our website suffered unplanned downtime from 1am to 10am this morning (PDT). The web growth we’ve seen lately came back to haunt us: our servers drew too much power and caused a circuit breaker to blow. We’ve reconfigured our power circuits to address this appropriately.

Hardware failure aside, the real embarrassment for Ohloh is how long it took for us to respond – it was unacceptable. While our site is currently monitored, none of us caught the alerts at 1am. The SMS/emails didn’t wake us up (as they have in the past). We clearly need a better system.

My first step is to find a better web monitoring service – hopefully with one that has some type of escalation procedure. Ideally I’d like a service that starts by sending email/SMS but then escalates to calling alternative phone numbers until someone responds. I’d welcome any suggestions below (or at jason@ohloh.net if you prefer).

Our sincerest apologies. The beer’s on us next time we meet.


4 Responses to Unexpected Outage