This is page 2 of:
Amazon’s Details Expose Cloud’s Ugly Side
With the sudden loss of a usable network, many storage nodes lost contact with their replicas. Amazon’s system is set up so that when that happens, the storage node assumes the replica has gone bad and immediately begins searching for a place to create a new replica. Normally, that would happen in milliseconds. But it wasn’t until techs identified and corrected the network mistake that those storage nodes could try to mirror themselves.
When the network was restored, it was a catastrophe. A large number of nodes simultaneously went looking for places to replicate. The available free storage was quickly exhausted, leaving many nodes stuck in a loop, searching for free space—what Amazon called a “re-mirroring storm” that prevented 13 percent of the storage volumes in the affected availability zone from doing anything other than looking for space that wasn’t there.
All those requests for more space were hammering on a software control plane that did the work of creating new storage volumes. Because the control plane was configured with a long time-out period, requests for space began to back up. That used up all the processor threads for the control plane, which locked that up. Result: The problems spread from a single availability zone to other cloud availability zones in the Virgina datacenter.
At 2:40 AM Los Angeles time—two hours after the original network mistake—techs disabled the capability of nodes in the original availability zone to ask for new space. By 2:50 AM, the control plane began to stabilize.
But by 5:30 AM, as the number of stuck storage nodes increased, the control plane began to fail again—and this time, it was knocked out entirely. At 8:20 AM, techs began disabling all communication between storage nodes in the original availability zone and the control plane. Once again, everything outside that zone began returning to normal.
By 11:30 AM, techs figured out a way to block the servers in the problem zone from asking each other for storage space that none of the other servers had, either. By 12:00 PM, error rates had returned to near normal—but the number of stuck volumes was back up to 13 percent.
And the only way to get them unstuck was to physically bring in lots more storage. There was no way to kill off the many stuck data replicas until working replicas were created, nor was there space to create the working replicas without new hardware. Amazon couldn’t even use its own cloud services for that storage—its “regions” are kept isolated from each other to keep problems from spreading.
Techs weren’t able to start adding new storage until 2:00 AM on April 22—more than a day after the start of the outage.
May 5th, 2011 at 1:21 pm
I have to disagree with the simplistic takeaway that 24/7 systems are inherently riskier than 23/7 systems. “23/7” is just a name I gave for anything less than 24/7 that requires a maintenance window. My experience is that the risk is virtually the same, just different.
I have a lot of experience with our 24/7 systems, and dealing with interfaces to many third-party applications and hosts of both flavors. I find that when issues arise with 24/7 systems, they tend to be smaller in nature and get diagnosed and resolved quicker. 23/7 systems on the other hand, while issues may be less often, when they occur, they have a much higher chance of being catastrophic in nature and take much longer to get resolved. There are always exceptions on both sides but this has been my experience.