Amazon’s Details Expose Cloud’s Ugly Side

Written by Frank Hayes
May 4th, 2011

In a detailed postmortem of its days-long cloud-storage outage, Amazon on April 29 delivered a blow-by-blow explanation of what went wrong: One networking mistake generated a cloud full of “stuck” storage, which in turn filled up all available space with junk data in an attempt to automatically recover and finally required Amazon to bring in lots of new storage hardware to unjam the system.

The cascading problems were the result of Amazon’s efforts to promise continuous availability of its cloud storage. That meant no downtime for maintenance windows—Amazon’s network techs had to work without a net, and this time they were unlucky. But a dive into the details of the outage suggests that a cloud like Amazon’s may not be worth the risk, or even offer an advantage, for big retailers—even though Amazon itself is one of the biggest.

The question comes down to whether retailers need the constant availability that Amazon’s cloud storage offers. There’s a time when brick-and-mortar stores are closed and E-Commerce traffic slows to a trickle. Every retail IT shop knows when that is, and that’s the time for a maintenance window—shutting down all or part of a datacenter to make significant changes. That window provides a safety buffer, when changes can be tested and a single error can’t turn into a runaway problem.

Amazon’s cloud operation doesn’t have that option; it has to promise customers that the cloud will be available all the time. That need to provide the 24/7 uptime retailers don’t need could be the greatest source of risk for any retailer considering the cloud.

According to Amazon, its cloud outage began with exactly the sort of change that a conventional datacenter would use a maintenance window for. At 12:47 AM Los Angeles time on April 21, as part of a procedure to upgrade the network for one of Amazon’s “availability zones,” network techs shifted traffic off one high-capacity router to clear the path for upgrading it.

Traffic was supposed to be shifted to another high-capacity router. Instead, it was mistakenly redirected to a low-capacity network that couldn’t handle the load of storage nodes that are constantly making copies of themselves—a fundamental process in the way Amazon’s cloud storage works.

With the sudden loss of a usable network, many storage nodes lost contact with their replicas.


One Comment | Read Amazon’s Details Expose Cloud’s Ugly Side

  1. Steve Sommers Says:

    I have to disagree with the simplistic takeaway that 24/7 systems are inherently riskier than 23/7 systems. “23/7” is just a name I gave for anything less than 24/7 that requires a maintenance window. My experience is that the risk is virtually the same, just different.

    I have a lot of experience with our 24/7 systems, and dealing with interfaces to many third-party applications and hosts of both flavors. I find that when issues arise with 24/7 systems, they tend to be smaller in nature and get diagnosed and resolved quicker. 23/7 systems on the other hand, while issues may be less often, when they occur, they have a much higher chance of being catastrophic in nature and take much longer to get resolved. There are always exceptions on both sides but this has been my experience.


StorefrontBacktalk delivers the latest retail technology news & analysis. Join more than 60,000 retail IT leaders who subscribe to our free weekly email. Sign up today!

Most Recent Comments

Why Did Gonzales Hackers Like European Cards So Much Better?

I am still unclear about the core point here-- why higher value of European cards. Supply and demand, yes, makes sense. But the fact that the cards were chip and pin (EMV) should make them less valuable because that demonstrably reduces the ability to use them fraudulently. Did the author mean that the chip and pin cards could be used in a country where EMV is not implemented--the US--and this mis-match make it easier to us them since the issuing banks may not have as robust anti-fraud controls as non-EMV banks because they assumed EMV would do the fraud prevention for them Read more...
Two possible reasons that I can think of and have seen in the past - 1) Cards issued by European banks when used online cross border don't usually support AVS checks. So, when a European card is used with a billing address that's in the US, an ecom merchant wouldn't necessarily know that the shipping zip code doesn't match the billing code. 2) Also, in offline chip countries the card determines whether or not a transaction is approved, not the issuer. In my experience, European issuers haven't developed the same checks on authorization requests as US issuers. So, these cards might be more valuable because they are more likely to get approved. Read more...
A smart card slot in terminals doesn't mean there is a reader or that the reader is activated. Then, activated reader or not, the U.S. processors don't have apps certified or ready to load into those terminals to accept and process smart card transactions just yet. Don't get your card(t) before the terminal (horse). Read more...
The marketplace does speak. More fraud capacity translates to higher value for the stolen data. Because nearly 100% of all US transactions are authorized online in real time, we have less fraud regardless of whether the card is Magstripe only or chip and PIn. Hence, $10 prices for US cards vs $25 for the European counterparts. Read more...
@David True. The European cards have both an EMV chip AND a mag stripe. Europeans may generally use the chip for their transactions, but the insecure stripe remains vulnerable to skimming, whether it be from a false front on an ATM or a dishonest waiter with a handheld skimmer. If their stripe is skimmed, the track data can still be cloned and used fraudulently in the United States. If European banks only detect fraud from 9-5 GMT, that might explain why American criminals prefer them over American bank issued cards, who have fraud detection in place 24x7. Read more...

Our apologies. Due to legal and security copyright issues, we can't facilitate the printing of Premium Content. If you absolutely need a hard copy, please contact customer service.