This is page 2 of:
Human Error Is Still Amazon Cloud’s Achilles Heel
At 5:02 PM on December 24 the Amazon team disabled the ability of the load balancers to scale up or down or be modified. That stopped the spread of the problem. “At the peak of the event, 6.8 percent of running ELB load balancers were impacted,” Amazon said. That’s a bit coy, though—the other 93.2 percent were technically operating correctly but outside the control of customers.
The team manually recovered some of the affected load balancers, but the main plan was to rebuild the deleted state data as of 12:24 PM, then merge in all the API calls after that point to create an uncorrupted configuration for each load balancer and get them working correctly again.
The first try at doing that took several hours. It failed.
At 2:45 AM on December 25 a different approach finally made it possible to restore a snapshot of the ELB state data to what it was more than 12 hours earlier.
At 5:40 AM on December 25 15 hours worth of API calls and state changes were finally merged in and verified, and the team began to slowly re-enable the APIs and to recover the load balancers.
At 8:15 AM on December 25 most of the APIs and workflows had been re-enabled.
At 10:30 AM on December 25 almost all load balancers were working correctly.
At 12:05 PM on December 25 Amazon announced that its U.S.-East cloud was operating normally again.
Yes, Amazon has learned from the experience—changing access controls so a programmer can’t do that again, adding checks for the health of state data to its data-recovery process and starting to work up ways for load balancers to heal themselves in the future. And to be fair, this incident wasn’t as lengthy as the April 2011 incident in which a change in network configuration (another human error) paralyzed much of Amazon’s cloud storage for three days.
But coming on Christmas Eve, the timing was really lousy. Fortunately for most big chains, they’re not using Amazon’s cloud—yet.
But these high-profile Amazon problems are actually a good sign, at least if you’re not Netflix or one of the smaller E-tailers who were slammed by the load-balancing failure. The catastrophes are shorter, fewer and further between, and Amazon is getting better at dealing with them.
The problem that remains: Amazon is still learning. And all that learning is still happening in datacenters that never shut down for maintenance.
There’s an irony here: We know from our own experience that everything in a cloud can be moved out of a particular datacenter, so it doesn’t have to run 24/7 forever. When Hurricane Sandy was heading up the East Coast of the U.S., StorefrontBacktalk‘s cloud provider (not Amazon) moved our virtual production server from Virginia to Chicago, apparently without a glitch, and presumably did the same with all the other cloud servers in that threatened Virginia datacenter. We have no doubt Amazon can do the same thing (and quite possibly did).
That dodges some maintenance-related problems. But it wouldn’t have helped in this case. Things are getting better. But cloud still may not be mature enough for a big chain’s production E-Commerce system—with or without humans.