Amazon’s Details Expose Cloud’s Ugly Side
Written by Frank HayesIn a detailed postmortem of its days-long cloud-storage outage, Amazon on April 29 delivered a blow-by-blow explanation of what went wrong: One networking mistake generated a cloud full of “stuck” storage, which in turn filled up all available space with junk data in an attempt to automatically recover and finally required Amazon to bring in lots of new storage hardware to unjam the system.
The cascading problems were the result of Amazon’s efforts to promise continuous availability of its cloud storage. That meant no downtime for maintenance windows—Amazon’s network techs had to work without a net, and this time they were unlucky. But a dive into the details of the outage suggests that a cloud like Amazon’s may not be worth the risk, or even offer an advantage, for big retailers—even though Amazon itself is one of the biggest.
The question comes down to whether retailers need the constant availability that Amazon’s cloud storage offers. There’s a time when brick-and-mortar stores are closed and E-Commerce traffic slows to a trickle. Every retail IT shop knows when that is, and that’s the time for a maintenance window—shutting down all or part of a datacenter to make significant changes. That window provides a safety buffer, when changes can be tested and a single error can’t turn into a runaway problem.
Amazon’s cloud operation doesn’t have that option; it has to promise customers that the cloud will be available all the time. That need to provide the 24/7 uptime retailers don’t need could be the greatest source of risk for any retailer considering the cloud.
According to Amazon, its cloud outage began with exactly the sort of change that a conventional datacenter would use a maintenance window for. At 12:47 AM Los Angeles time on April 21, as part of a procedure to upgrade the network for one of Amazon’s “availability zones,” network techs shifted traffic off one high-capacity router to clear the path for upgrading it.
Traffic was supposed to be shifted to another high-capacity router. Instead, it was mistakenly redirected to a low-capacity network that couldn’t handle the load of storage nodes that are constantly making copies of themselves—a fundamental process in the way Amazon’s cloud storage works.
With the sudden loss of a usable network, many storage nodes lost contact with their replicas.
May 5th, 2011 at 1:21 pm
I have to disagree with the simplistic takeaway that 24/7 systems are inherently riskier than 23/7 systems. “23/7” is just a name I gave for anything less than 24/7 that requires a maintenance window. My experience is that the risk is virtually the same, just different.
I have a lot of experience with our 24/7 systems, and dealing with interfaces to many third-party applications and hosts of both flavors. I find that when issues arise with 24/7 systems, they tend to be smaller in nature and get diagnosed and resolved quicker. 23/7 systems on the other hand, while issues may be less often, when they occur, they have a much higher chance of being catastrophic in nature and take much longer to get resolved. There are always exceptions on both sides but this has been my experience.