Human Error Is Still Amazon Cloud’s Achilles Heel

Written by Frank Hayes
January 2nd, 2013

The Amazon Cloud outage on December 24—the one that knocked Netflix offline for much of Christmas Eve—was due purely to human error. And it was the dumbest sort of human error: an Amazon developer with special privileges mistakenly ran a maintenance process against the production system, wiping out critical state data—and then didn’t realize he had crippled the system until hours after it began causing problems for customers, according to the version of events Amazon released on Monday (Dec. 31).

It then took more than 12 hours (including a false start or two) for Amazon’s team to re-create the data, and several more hours to slowly get the system working again. Total outage time: possibly the longest 23 hours and 41 minutes in Amazon’s history.

According to Amazon’s own summary of the outage—beg pardon, “service event”—the problem originated in the load-balancing systems for Amazon’s cloud and only affected customers in the Eastern region of the U.S. At 12:24 PM Pacific time (3:24 PM Eastern) on December 24, “a portion of the ELB [Elastic Load Balancing System] state data was logically deleted. This data is used and maintained by the ELB control plane to manage the configuration of the ELB load balancers in the region (for example, tracking all the backend hosts to which traffic should be routed by each load balancer),” according to Amazon.

Translation: Amazon’s cloud forgot everything it knew about how to let customers do load balancing.

The data was deleted by “one of a very small number of developers who have access to this production environment,” inadvertently running the maintenance process against the production ELB state data, according to the Amazon report.

How was that possible? It turns out that most of the access controls for the cloud go through a strict change management process, which should have prevented this mistake. But Amazon is in the process of automating some cloud-maintenance processes, and a small number of developers have permission to run those processes manually. It also turns out that once those developers accessed the processes once, they didn’t have to go through an access process again—in effect, getting rid of the “Do you really want to bring the Amazon Cloud crashing down? OK/Cancel” message.

Yes, Amazon has fixed that—now everything goes through change management. Back to the timeline:

At 12:24 PM on December 24 ELB state data was deleted. “The ELB control plane began experiencing high latency and error rates for API calls to manage ELB load balancers,” according to Amazon. But the system was still handling basic load-balancing requests to create and manage new load balancers, because it didn’t need state data to do that.

Amazon’s technical teams spotted the API errors but didn’t spot the pattern that new load balancers could be managed while older (pre-12:24 PM) load balancers couldn’t be properly managed, because their configuration data was gone.

Meanwhile, some customers began to see performance problems with their cloud applications. It wasn’t until the team started digging into the specifics of those performance problems that they spotted the missing state data as the root cause of the problem.

At 5:02 PM on December 24 the Amazon team stopped the spread of the problem and began looking for a way to fix it.


Comments are closed.


StorefrontBacktalk delivers the latest retail technology news & analysis. Join more than 60,000 retail IT leaders who subscribe to our free weekly email. Sign up today!

Most Recent Comments

Why Did Gonzales Hackers Like European Cards So Much Better?

I am still unclear about the core point here-- why higher value of European cards. Supply and demand, yes, makes sense. But the fact that the cards were chip and pin (EMV) should make them less valuable because that demonstrably reduces the ability to use them fraudulently. Did the author mean that the chip and pin cards could be used in a country where EMV is not implemented--the US--and this mis-match make it easier to us them since the issuing banks may not have as robust anti-fraud controls as non-EMV banks because they assumed EMV would do the fraud prevention for them Read more...
Two possible reasons that I can think of and have seen in the past - 1) Cards issued by European banks when used online cross border don't usually support AVS checks. So, when a European card is used with a billing address that's in the US, an ecom merchant wouldn't necessarily know that the shipping zip code doesn't match the billing code. 2) Also, in offline chip countries the card determines whether or not a transaction is approved, not the issuer. In my experience, European issuers haven't developed the same checks on authorization requests as US issuers. So, these cards might be more valuable because they are more likely to get approved. Read more...
A smart card slot in terminals doesn't mean there is a reader or that the reader is activated. Then, activated reader or not, the U.S. processors don't have apps certified or ready to load into those terminals to accept and process smart card transactions just yet. Don't get your card(t) before the terminal (horse). Read more...
The marketplace does speak. More fraud capacity translates to higher value for the stolen data. Because nearly 100% of all US transactions are authorized online in real time, we have less fraud regardless of whether the card is Magstripe only or chip and PIn. Hence, $10 prices for US cards vs $25 for the European counterparts. Read more...
@David True. The European cards have both an EMV chip AND a mag stripe. Europeans may generally use the chip for their transactions, but the insecure stripe remains vulnerable to skimming, whether it be from a false front on an ATM or a dishonest waiter with a handheld skimmer. If their stripe is skimmed, the track data can still be cloned and used fraudulently in the United States. If European banks only detect fraud from 9-5 GMT, that might explain why American criminals prefer them over American bank issued cards, who have fraud detection in place 24x7. Read more...

Our apologies. Due to legal and security copyright issues, we can't facilitate the printing of Premium Content. If you absolutely need a hard copy, please contact customer service.