advertisement
advertisement

This is page 3 of:

Amazon’s Details Expose Cloud’s Ugly Side

May 4th, 2011

Techs weren’t able to start adding new storage until 2:00 AM on April 22—more than a day after the start of the outage. By 12:30 PM, all but 2.2 percent of the volumes were restored, although not all of them were completely unstuck. It took until 11:30 AM on April 23 to work out how to reconnect the stuck volumes to the control plane without overloading it again and to test the process. By 6:15 PM, most nodes were communicating again.

Then came the process of manually trying to fix the remaining 2.2 percent of the nodes that were still stuck. By 12:30 PM on April 24—three and a half days after the original outage—all but 1.04 percent of the affected volumes were recovered. In the end, 0.07 percent of the volumes could never be restored. (Amazon sent snapshots of that data to the customers it belonged to, advising them, “If you have no need for this snapshot, please delete it to avoid incurring storage charges.”)

And Amazon’s cloud database service? That was affected, too. And the results were even more catastrophic. The cloud database service uses the cloud storage system. For customers whose databases were entirely in the crippled availability zone, even though at worst only 13 percent of the storage volumes were stuck, at the peak of the problem 45 percent of those databases were crippled by stuck volumes.

The final tally for the outage: Exactly half a week during which a significant number of Amazon cloud customers suffered from crippled or nonexistent IT functionality.

In a conventional datacenter, with a conventional approach to maintenance windows, that would have been almost impossible (although American Eagle Outfitters might beg to differ). The initial network configuration error would probably have been caught as soon as testing of the changes began. The cascade of stuck storage nodes, the control plane thread starvation, the exhausted storage space and the crippled databases—they never would have happened.

But all that technology dedicated to supporting Amazon’s high priority for availability ultimately produced 1 percent of a year as downtime in a single stretch.

Amazon has outlined changes it plans to make, and that should make the next incident less painful. But when an IT shop has to work without a net, there will be a next incident. There’s no way to avoid it.


advertisement

One Comment | Read Amazon’s Details Expose Cloud’s Ugly Side

  1. Steve Sommers Says:

    I have to disagree with the simplistic takeaway that 24/7 systems are inherently riskier than 23/7 systems. “23/7” is just a name I gave for anything less than 24/7 that requires a maintenance window. My experience is that the risk is virtually the same, just different.

    I have a lot of experience with our 24/7 systems, and dealing with interfaces to many third-party applications and hosts of both flavors. I find that when issues arise with 24/7 systems, they tend to be smaller in nature and get diagnosed and resolved quicker. 23/7 systems on the other hand, while issues may be less often, when they occur, they have a much higher chance of being catastrophic in nature and take much longer to get resolved. There are always exceptions on both sides but this has been my experience.

Newsletters

StorefrontBacktalk delivers the latest retail technology news & analysis. Join more than 60,000 retail IT leaders who subscribe to our free weekly email. Sign up today!
advertisement

Most Recent Comments

Why Did Gonzales Hackers Like European Cards So Much Better?

I am still unclear about the core point here-- why higher value of European cards. Supply and demand, yes, makes sense. But the fact that the cards were chip and pin (EMV) should make them less valuable because that demonstrably reduces the ability to use them fraudulently. Did the author mean that the chip and pin cards could be used in a country where EMV is not implemented--the US--and this mis-match make it easier to us them since the issuing banks may not have as robust anti-fraud controls as non-EMV banks because they assumed EMV would do the fraud prevention for them Read more...
Two possible reasons that I can think of and have seen in the past - 1) Cards issued by European banks when used online cross border don't usually support AVS checks. So, when a European card is used with a billing address that's in the US, an ecom merchant wouldn't necessarily know that the shipping zip code doesn't match the billing code. 2) Also, in offline chip countries the card determines whether or not a transaction is approved, not the issuer. In my experience, European issuers haven't developed the same checks on authorization requests as US issuers. So, these cards might be more valuable because they are more likely to get approved. Read more...
A smart card slot in terminals doesn't mean there is a reader or that the reader is activated. Then, activated reader or not, the U.S. processors don't have apps certified or ready to load into those terminals to accept and process smart card transactions just yet. Don't get your card(t) before the terminal (horse). Read more...
The marketplace does speak. More fraud capacity translates to higher value for the stolen data. Because nearly 100% of all US transactions are authorized online in real time, we have less fraud regardless of whether the card is Magstripe only or chip and PIn. Hence, $10 prices for US cards vs $25 for the European counterparts. Read more...
@David True. The European cards have both an EMV chip AND a mag stripe. Europeans may generally use the chip for their transactions, but the insecure stripe remains vulnerable to skimming, whether it be from a false front on an ATM or a dishonest waiter with a handheld skimmer. If their stripe is skimmed, the track data can still be cloned and used fraudulently in the United States. If European banks only detect fraud from 9-5 GMT, that might explain why American criminals prefer them over American bank issued cards, who have fraud detection in place 24x7. Read more...

StorefrontBacktalk
Our apologies. Due to legal and security copyright issues, we can't facilitate the printing of Premium Content. If you absolutely need a hard copy, please contact customer service.