Disasters Come in All Shapes and Sizes

Published by

on

Analysis of Amazon Outage

In a year that has already seen earthquakes, tsunamis and tornadoes of epic proportions, those of us who live in the world of cloud computing learned that disasters of equivalent proportions can happen from within the data center and not just to the data center. Today Amazon Web Services (AWS) released their Summary of the Amazon EC2 and Amazon RDS Service Disruption in the U.S. East Region, providing insight into the causes of the April 21 outage.

The effort of the AWS team that worked to resolve the event, diagnose the situation and begin applying fixes was remarkable. AWS produced a detailed post-mortem that clearly identifies the root cause of the event and factors that contributed to the rapid growth, and prolonged nature, of the outage.

So what happened? There were three main factors that contributed to the disruption:

  • Human error
  • Architectural design
  • Software bug

On April 21, a network change was being made in the U.S. East region. Part of the standard change procedure required shunting the network traffic from one router to another router on the primary network. Instead, the network traffic was diverted to a secondary network that could not handle the load. The result was that a group of nodes on the network responsible for storing data lost connectivity and became isolated. A normal operation for these nodes is to create replicas, or mirror their data, in multiple places to ensure that the data survives in case any one node has a failure. As designed, this process takes place very quickly, however, in an isolated state the nodes frantically searched for space to protect their data but could find none.

The first attempt to resolve the problem was to rollback the network change, but once reconnected the frantically searching nodes created a  “re-mirroring storm” that quickly exhausted storage capacity and caused further degradation. This in turn began to significantly impact other parts of the infrastructure. Finally, the software code on the storage nodes not only did not back the nodes off from aggressively searching for space to re-mirror, the code had a flaw, or bug, known as a “race condition.” While a race condition is a flaw, under normal operations in this architecture it would have had little to no impact. However, in the midst of the re-mirroring storm this bug also contributed to further expansion of the event.

Certainly interesting technical details about Amazon Web Services cloud infrastructure emerged, but for those familiar with root cause analysis there were no real surprises — a chain of events triggered by human error. I echo a colleague’s observation that Amazon’s commitment to behavioral and organizational change to better serve customers is impressive. Disasters, both natural and those created by our own hand, will happen. What’s important is what we learn and how we change because of them.

Heath-

Leave a comment

Design a site like this with WordPress.com
Get started