This is page 2 of:
Wal-Mart’s Painful Common-Point-Of-Failure Lesson
Even though all POS units in every one of the chain’s more than 4,300 U.S. stores stopped being capable of handling payment cards, Walmart.com was still able to process credit and debit transactions. That’s because online payment processing is on a different system.
Had Wal-Mart realized that its dot.com arm was the only part of the company capable of processing payment card transactions, it could have suggested that to the frustrated customers who visited its stores that morning.
The results in the brick-and-mortar stores were chaotic. Effects of the outage stretched from coast to coast. Within minutes after the outage started at about 9 A.M. Chicago time, some Wal-Mart stores closed to avoid dealing with long lines of customers who couldn’t pay for their items. Other stores remained open but had greeters inform customers of the sudden cash-or-checks-only situation, which turned many customers away. Still other stores didn’t inform customers until they arrived at the checkouts.
But which way each store went was up to the individual store’s manager. No chain-wide policy existed for this situation. Wal-Mart simply didn’t believe it could happen.
By 11 A.M. Chicago time, many stores were able to handle payment cards. But some stores reported they were still having problems more than five hours after the outage began.
Wal-Mart wouldn’t estimate exactly how much the outage cost. And how could it? Customers were turned away. Still other customers went elsewhere as word spread about the outage. Shopping carts were abandoned at checkouts when customers couldn’t pay—which resulted in fresh groceries that had to be thrown away because the food couldn’t be restocked. The final tab for the failure may not be known for months.
That was the result outside the datacenter. What happened inside is a lot less clear.
Wal-Mart’s statement that “While doing some required maintenance on our data system, we had a breaker fail on the back-up system” leaves more questions than answers.
Wal-Mart’s IT people were doing “required maintenance”? It certainly wasn’t the kind of maintenance you’d schedule in advance. No one schedules maintenance on a critical system at 9 A.M. on a Thursday. That’s what the graveyard shift is for.
September 30th, 2010 at 7:42 am
And the question is, how many more of these types of unplanned for situations exist – at Wal-Mart and elsewhere – just waiting to cause their own “15 minutes of fame”, or 5 hours as in Wal-Mart’s case?
September 30th, 2010 at 8:17 am
Great story! I work on small MDF’s & IDF’s in all sorts of businesses and the haphazard spaghetti of unlabeled wiring is a disaster waiting to happen. It’s amazing how poorly documented most networks really are, and the non-IT stuff that gets stored in the “equipment room”. Stuff like extra office chairs, assorted dinnerware & styrofoam cups, miscellaneous furniture, plant pots, bags of road salt, and much more. While servers are backed up reasonably regularly, network infrastructure disaster recovery is often barely on the radar screen. Common points of failure are everywhere in this scenario, so once an outage event gets started it’s chickens without a head running around trying to restore operations in an ad hoc fashion. A shame really, because with a handheld labeler and some graph paper it is so easy to clearly document a small network and then look for disaster recovery shortcomings. Nice to know that the Big Guys miss this kind of stuff too!
September 30th, 2010 at 12:44 pm
Frank,
“Still, that’s all speculation.” … indeed it is, as are the theories that the outage was caused by aliens or that it was done by terrorists. I really don’t see the point in speculating about what caused the outage when Walmart, not unsurprisingly and certainly not unreasonably, doesn’t want to illuminate us. For Walmart IT this has to be a major embarrassment and I suspect that heads will roll.
A glaring omission in many organizations is failing to run risk studies to identify these potential problem situations. Of course, even when you’ve done that kind of groundwork Murphy’s Law pretty much guarantees that you will have missed something. Like the financial company I knew that had a data center on the 8th floor. They had every eventuality covered and they had done a very thorough end-to-end risk assessment … but they left out one possibility probably because they were so high up: Flood. Yep, the water tank on the roof leaked and they found themselves wading through the computer room.
October 1st, 2010 at 7:16 am
But the question is, is this just one of those things that cannot be planned for? Or can companies put in place policies to ensure that such a failure does not happen? And who should be responsible for that policy and policing it?
Denise Bedell
editor/blogger
CFOZone.com
October 1st, 2010 at 9:01 am
Interesting questions, Denise, but I think they have clear answers. Going in reverse:
“Who should be responsible for that policy and policing it?” IT and, ultimately, the CIO.
“Is this just one of those things that cannot be planned for? Or can companies put in place policies to ensure that such a failure does not happen?” We’re going to need for Wal-Mart to complete–or at least get further along–its internal probe so that we can hopefully better understand exactly what happened. But let’s take a look at what we do know. Wal-Mart’s official statement, which you just KNOW had to go through Legal and quite a few others, said: “We had a breaker fail on the back-up system, which disrupted our ability to process some credit and debit card transactions.” No matter how you interpret that, it meant that there existed one centralized system handling every Wal-Mart payment process in the country(in-store only. Online and mobile were spared.). It was a combination of happenings, but there was clearly a centralized flaw. It’s not as simple as having plugged too many critical devices into a large UPS, but no system of this size and importance should be able to taken down by any one action.
It was helped along by human intervention, where someone was trying to help and they apparently–and inadvertently–did the wrong. What they did and exactly what they were reacting to will answer many of your questions, but I think it’s fair to conclude that there too much single-point-of-failure exposure here. I guess the fact that Online was handled separately is the only hint that there was at least some single-point-of-failure protections here. (Although that system being separate seemed to be more of a convenience, as opposed to a plan to protect at least some transactions in the case of this kind of problem.
October 11th, 2010 at 10:40 am
I am sure they have heard of separate A/B Power Paths….Wow can’t believe that WM would not have that (assuming that is what actually happened) understanding what is at stake if a failure occurs
October 22nd, 2010 at 6:16 pm
Well after reading, you will see that there was not a single point of failure on it. There were obviously two systems that handle these transactions. The primary was brought down for maintenance, oh lets say it ran out of disk space or something. At that point they were operating on a backup system (secondary) At that point you are very vunerable and the pucker factor is high for everyone. They have probably done this many times in the past with no issues and this time it bit em when the breaker blew. But why a system this large does not have a/b power I don’t know. Or they do and someone hooked that server up with both power supplies to the the a channel on accident instead of both.
December 2nd, 2010 at 7:40 am
Have there been any updates on this story? Has Wal-Mart said what happened and what they have done to remedy it and improve Business Continuity Management around this?
Denise Bedell
editor/blogger
CFOZone.com