advertisement
advertisement

This is page 2 of:

Amazon’s Details Expose Cloud’s Ugly Side

May 4th, 2011

With the sudden loss of a usable network, many storage nodes lost contact with their replicas. Amazon’s system is set up so that when that happens, the storage node assumes the replica has gone bad and immediately begins searching for a place to create a new replica. Normally, that would happen in milliseconds. But it wasn’t until techs identified and corrected the network mistake that those storage nodes could try to mirror themselves.

When the network was restored, it was a catastrophe. A large number of nodes simultaneously went looking for places to replicate. The available free storage was quickly exhausted, leaving many nodes stuck in a loop, searching for free space—what Amazon called a “re-mirroring storm” that prevented 13 percent of the storage volumes in the affected availability zone from doing anything other than looking for space that wasn’t there.

All those requests for more space were hammering on a software control plane that did the work of creating new storage volumes. Because the control plane was configured with a long time-out period, requests for space began to back up. That used up all the processor threads for the control plane, which locked that up. Result: The problems spread from a single availability zone to other cloud availability zones in the Virgina datacenter.

At 2:40 AM Los Angeles time—two hours after the original network mistake—techs disabled the capability of nodes in the original availability zone to ask for new space. By 2:50 AM, the control plane began to stabilize.

But by 5:30 AM, as the number of stuck storage nodes increased, the control plane began to fail again—and this time, it was knocked out entirely. At 8:20 AM, techs began disabling all communication between storage nodes in the original availability zone and the control plane. Once again, everything outside that zone began returning to normal.

By 11:30 AM, techs figured out a way to block the servers in the problem zone from asking each other for storage space that none of the other servers had, either. By 12:00 PM, error rates had returned to near normal—but the number of stuck volumes was back up to 13 percent.

And the only way to get them unstuck was to physically bring in lots more storage. There was no way to kill off the many stuck data replicas until working replicas were created, nor was there space to create the working replicas without new hardware. Amazon couldn’t even use its own cloud services for that storage—its “regions” are kept isolated from each other to keep problems from spreading.

Techs weren’t able to start adding new storage until 2:00 AM on April 22—more than a day after the start of the outage.


advertisement

One Comment | Read Amazon’s Details Expose Cloud’s Ugly Side

  1. Steve Sommers Says:

    I have to disagree with the simplistic takeaway that 24/7 systems are inherently riskier than 23/7 systems. “23/7” is just a name I gave for anything less than 24/7 that requires a maintenance window. My experience is that the risk is virtually the same, just different.

    I have a lot of experience with our 24/7 systems, and dealing with interfaces to many third-party applications and hosts of both flavors. I find that when issues arise with 24/7 systems, they tend to be smaller in nature and get diagnosed and resolved quicker. 23/7 systems on the other hand, while issues may be less often, when they occur, they have a much higher chance of being catastrophic in nature and take much longer to get resolved. There are always exceptions on both sides but this has been my experience.

Newsletters

StorefrontBacktalk delivers the latest retail technology news & analysis. Join more than 60,000 retail IT leaders who subscribe to our free weekly email. Sign up today!
advertisement

Most Recent Comments

Why Did Gonzales Hackers Like European Cards So Much Better?

I am still unclear about the core point here-- why higher value of European cards. Supply and demand, yes, makes sense. But the fact that the cards were chip and pin (EMV) should make them less valuable because that demonstrably reduces the ability to use them fraudulently. Did the author mean that the chip and pin cards could be used in a country where EMV is not implemented--the US--and this mis-match make it easier to us them since the issuing banks may not have as robust anti-fraud controls as non-EMV banks because they assumed EMV would do the fraud prevention for them Read more...
Two possible reasons that I can think of and have seen in the past - 1) Cards issued by European banks when used online cross border don't usually support AVS checks. So, when a European card is used with a billing address that's in the US, an ecom merchant wouldn't necessarily know that the shipping zip code doesn't match the billing code. 2) Also, in offline chip countries the card determines whether or not a transaction is approved, not the issuer. In my experience, European issuers haven't developed the same checks on authorization requests as US issuers. So, these cards might be more valuable because they are more likely to get approved. Read more...
A smart card slot in terminals doesn't mean there is a reader or that the reader is activated. Then, activated reader or not, the U.S. processors don't have apps certified or ready to load into those terminals to accept and process smart card transactions just yet. Don't get your card(t) before the terminal (horse). Read more...
The marketplace does speak. More fraud capacity translates to higher value for the stolen data. Because nearly 100% of all US transactions are authorized online in real time, we have less fraud regardless of whether the card is Magstripe only or chip and PIn. Hence, $10 prices for US cards vs $25 for the European counterparts. Read more...
@David True. The European cards have both an EMV chip AND a mag stripe. Europeans may generally use the chip for their transactions, but the insecure stripe remains vulnerable to skimming, whether it be from a false front on an ATM or a dishonest waiter with a handheld skimmer. If their stripe is skimmed, the track data can still be cloned and used fraudulently in the United States. If European banks only detect fraud from 9-5 GMT, that might explain why American criminals prefer them over American bank issued cards, who have fraud detection in place 24x7. Read more...

StorefrontBacktalk
Our apologies. Due to legal and security copyright issues, we can't facilitate the printing of Premium Content. If you absolutely need a hard copy, please contact customer service.