Twitter late Thursday revealed the cause of a two-hour long mid-morning global service outage: Two parallel data center failures. As Twitter VP of engineering Mazen Rawashdeh explained in a blog post:
Data centers are designed to be redundant: when one system fails (as everything does at one time or another), a parallel system takes over. What was noteworthy about today’s outage was the coincidental failure of two parallel systems at nearly the same time.
I wish I could say that today’s outage could be explained by the Olympics or even acascading bug. Instead, it was due to this infrastructural double-whammy. We are investing aggressively in our systems to avoid this situation in the future.
Rawashdeh also added that the Twitter engineering team was “making the service even better and more stable than ever.”