This is page 3 of:
Amazon’s Details Expose Cloud’s Ugly Side
Techs weren’t able to start adding new storage until 2:00 AM on April 22—more than a day after the start of the outage. By 12:30 PM, all but 2.2 percent of the volumes were restored, although not all of them were completely unstuck. It took until 11:30 AM on April 23 to work out how to reconnect the stuck volumes to the control plane without overloading it again and to test the process. By 6:15 PM, most nodes were communicating again.
Then came the process of manually trying to fix the remaining 2.2 percent of the nodes that were still stuck. By 12:30 PM on April 24—three and a half days after the original outage—all but 1.04 percent of the affected volumes were recovered. In the end, 0.07 percent of the volumes could never be restored. (Amazon sent snapshots of that data to the customers it belonged to, advising them, “If you have no need for this snapshot, please delete it to avoid incurring storage charges.”)
And Amazon’s cloud database service? That was affected, too. And the results were even more catastrophic. The cloud database service uses the cloud storage system. For customers whose databases were entirely in the crippled availability zone, even though at worst only 13 percent of the storage volumes were stuck, at the peak of the problem 45 percent of those databases were crippled by stuck volumes.
The final tally for the outage: Exactly half a week during which a significant number of Amazon cloud customers suffered from crippled or nonexistent IT functionality.
In a conventional datacenter, with a conventional approach to maintenance windows, that would have been almost impossible (although American Eagle Outfitters might beg to differ). The initial network configuration error would probably have been caught as soon as testing of the changes began. The cascade of stuck storage nodes, the control plane thread starvation, the exhausted storage space and the crippled databases—they never would have happened.
But all that technology dedicated to supporting Amazon’s high priority for availability ultimately produced 1 percent of a year as downtime in a single stretch.
Amazon has outlined changes it plans to make, and that should make the next incident less painful. But when an IT shop has to work without a net, there will be a next incident. There’s no way to avoid it.
May 5th, 2011 at 1:21 pm
I have to disagree with the simplistic takeaway that 24/7 systems are inherently riskier than 23/7 systems. “23/7” is just a name I gave for anything less than 24/7 that requires a maintenance window. My experience is that the risk is virtually the same, just different.
I have a lot of experience with our 24/7 systems, and dealing with interfaces to many third-party applications and hosts of both flavors. I find that when issues arise with 24/7 systems, they tend to be smaller in nature and get diagnosed and resolved quicker. 23/7 systems on the other hand, while issues may be less often, when they occur, they have a much higher chance of being catastrophic in nature and take much longer to get resolved. There are always exceptions on both sides but this has been my experience.