advertisement
advertisement

This is page 3 of:

Wal-Mart’s Painful Common-Point-Of-Failure Lesson

September 30th, 2010

During that maintenance, Wal-Mart “had a breaker fail on the back-up system”? Was that a data backup? That wouldn’t disrupt a system unless data was being restored when the power went out. And why would Wal-Mart IT be restoring data to a critical system at 9 A.M.?

More likely is that the “back-up system” in Wal-Mart’s fuzzily worded statement was a back-up power system—an uninterruptible power supply (UPS) that should have been able to fail without any impact at all, because it’s just there in case the main power supply fails. How could that have caused a major failure?

Unfortunately, far too easily. Think of all those racks of equipment in your datacenter—servers and data arrays, each equipped with two power supplies, so if one fails the other will keep going. Those power supplies are supposed to be plugged into two different UPSs. That way, if one power source goes south, the other will keep things running.

But if both power supplies from a piece of equipment are plugged into the same UPS, then a single UPS outage—caused by a circuit breaker that fails—will take it down hard. Yes, that happens. Racks make it easy to swap equipment in and out. But when that’s done too quickly, it’s easy for IT operations staff to lose track of where the juice for each power supply is coming from. That creates a single point of failure—one that can go for months or even years without being spotted.

Still, that’s all speculation. The simple reality for Wal-Mart is that each store’s capability to handle payment cards depended on a connection from every card-swiping device to the single card-processing system.

As a result, one breaker gone bad was able to cripple thousands of stores. That’s a catastrophic IT operations failure. It shouldn’t have happened. What’s worse, Wal-Mart clearly couldn’t see that it ever could happen.

Yes, it can happen. That’s why you have redundant power supplies and do data backups. It’s why you set up plans to deal with the unthinkable and design systems so that when things do go wrong—even things that shouldn’t ever go wrong—there’s a fallback plan, not just in IT but in the stores, too.

Wal-Mart didn’t just have a failure in IT operations or a problem with one bad breaker. It had a fundamental weakness in its IT systems—and a blind spot that prevented anyone from seeing it in time.


advertisement

8 Comments | Read Wal-Mart’s Painful Common-Point-Of-Failure Lesson

  1. Bryan Larkin Says:

    And the question is, how many more of these types of unplanned for situations exist – at Wal-Mart and elsewhere – just waiting to cause their own “15 minutes of fame”, or 5 hours as in Wal-Mart’s case?

  2. Alan McRae Says:

    Great story! I work on small MDF’s & IDF’s in all sorts of businesses and the haphazard spaghetti of unlabeled wiring is a disaster waiting to happen. It’s amazing how poorly documented most networks really are, and the non-IT stuff that gets stored in the “equipment room”. Stuff like extra office chairs, assorted dinnerware & styrofoam cups, miscellaneous furniture, plant pots, bags of road salt, and much more. While servers are backed up reasonably regularly, network infrastructure disaster recovery is often barely on the radar screen. Common points of failure are everywhere in this scenario, so once an outage event gets started it’s chickens without a head running around trying to restore operations in an ad hoc fashion. A shame really, because with a handheld labeler and some graph paper it is so easy to clearly document a small network and then look for disaster recovery shortcomings. Nice to know that the Big Guys miss this kind of stuff too!

  3. Mark Gibbs Says:

    Frank,

    “Still, that’s all speculation.” … indeed it is, as are the theories that the outage was caused by aliens or that it was done by terrorists. I really don’t see the point in speculating about what caused the outage when Walmart, not unsurprisingly and certainly not unreasonably, doesn’t want to illuminate us. For Walmart IT this has to be a major embarrassment and I suspect that heads will roll.

    A glaring omission in many organizations is failing to run risk studies to identify these potential problem situations. Of course, even when you’ve done that kind of groundwork Murphy’s Law pretty much guarantees that you will have missed something. Like the financial company I knew that had a data center on the 8th floor. They had every eventuality covered and they had done a very thorough end-to-end risk assessment … but they left out one possibility probably because they were so high up: Flood. Yep, the water tank on the roof leaked and they found themselves wading through the computer room.

  4. Denise Bedell Says:

    But the question is, is this just one of those things that cannot be planned for? Or can companies put in place policies to ensure that such a failure does not happen? And who should be responsible for that policy and policing it?

    Denise Bedell
    editor/blogger
    CFOZone.com

  5. Evan Schuman Says:

    Interesting questions, Denise, but I think they have clear answers. Going in reverse:
    “Who should be responsible for that policy and policing it?” IT and, ultimately, the CIO.
    “Is this just one of those things that cannot be planned for? Or can companies put in place policies to ensure that such a failure does not happen?” We’re going to need for Wal-Mart to complete–or at least get further along–its internal probe so that we can hopefully better understand exactly what happened. But let’s take a look at what we do know. Wal-Mart’s official statement, which you just KNOW had to go through Legal and quite a few others, said: “We had a breaker fail on the back-up system, which disrupted our ability to process some credit and debit card transactions.” No matter how you interpret that, it meant that there existed one centralized system handling every Wal-Mart payment process in the country(in-store only. Online and mobile were spared.). It was a combination of happenings, but there was clearly a centralized flaw. It’s not as simple as having plugged too many critical devices into a large UPS, but no system of this size and importance should be able to taken down by any one action.
    It was helped along by human intervention, where someone was trying to help and they apparently–and inadvertently–did the wrong. What they did and exactly what they were reacting to will answer many of your questions, but I think it’s fair to conclude that there too much single-point-of-failure exposure here. I guess the fact that Online was handled separately is the only hint that there was at least some single-point-of-failure protections here. (Although that system being separate seemed to be more of a convenience, as opposed to a plan to protect at least some transactions in the case of this kind of problem.

  6. DC guy Says:

    I am sure they have heard of separate A/B Power Paths….Wow can’t believe that WM would not have that (assuming that is what actually happened) understanding what is at stake if a failure occurs

  7. Ed Says:

    Well after reading, you will see that there was not a single point of failure on it. There were obviously two systems that handle these transactions. The primary was brought down for maintenance, oh lets say it ran out of disk space or something. At that point they were operating on a backup system (secondary) At that point you are very vunerable and the pucker factor is high for everyone. They have probably done this many times in the past with no issues and this time it bit em when the breaker blew. But why a system this large does not have a/b power I don’t know. Or they do and someone hooked that server up with both power supplies to the the a channel on accident instead of both.

  8. Denise Bedell Says:

    Have there been any updates on this story? Has Wal-Mart said what happened and what they have done to remedy it and improve Business Continuity Management around this?

    Denise Bedell
    editor/blogger
    CFOZone.com

Newsletters

StorefrontBacktalk delivers the latest retail technology news & analysis. Join more than 60,000 retail IT leaders who subscribe to our free weekly email. Sign up today!
advertisement

Most Recent Comments

Why Did Gonzales Hackers Like European Cards So Much Better?

I am still unclear about the core point here-- why higher value of European cards. Supply and demand, yes, makes sense. But the fact that the cards were chip and pin (EMV) should make them less valuable because that demonstrably reduces the ability to use them fraudulently. Did the author mean that the chip and pin cards could be used in a country where EMV is not implemented--the US--and this mis-match make it easier to us them since the issuing banks may not have as robust anti-fraud controls as non-EMV banks because they assumed EMV would do the fraud prevention for them Read more...
Two possible reasons that I can think of and have seen in the past - 1) Cards issued by European banks when used online cross border don't usually support AVS checks. So, when a European card is used with a billing address that's in the US, an ecom merchant wouldn't necessarily know that the shipping zip code doesn't match the billing code. 2) Also, in offline chip countries the card determines whether or not a transaction is approved, not the issuer. In my experience, European issuers haven't developed the same checks on authorization requests as US issuers. So, these cards might be more valuable because they are more likely to get approved. Read more...
A smart card slot in terminals doesn't mean there is a reader or that the reader is activated. Then, activated reader or not, the U.S. processors don't have apps certified or ready to load into those terminals to accept and process smart card transactions just yet. Don't get your card(t) before the terminal (horse). Read more...
The marketplace does speak. More fraud capacity translates to higher value for the stolen data. Because nearly 100% of all US transactions are authorized online in real time, we have less fraud regardless of whether the card is Magstripe only or chip and PIn. Hence, $10 prices for US cards vs $25 for the European counterparts. Read more...
@David True. The European cards have both an EMV chip AND a mag stripe. Europeans may generally use the chip for their transactions, but the insecure stripe remains vulnerable to skimming, whether it be from a false front on an ATM or a dishonest waiter with a handheld skimmer. If their stripe is skimmed, the track data can still be cloned and used fraudulently in the United States. If European banks only detect fraud from 9-5 GMT, that might explain why American criminals prefer them over American bank issued cards, who have fraud detection in place 24x7. Read more...

StorefrontBacktalk
Our apologies. Due to legal and security copyright issues, we can't facilitate the printing of Premium Content. If you absolutely need a hard copy, please contact customer service.