This is page 2 of:
Recovery Disaster: PayPal Crash Strands Merchants
Notice that along with PayPal’s two big technical glitches—the networking hardware meltdown and the failover that didn’t work—there was a third non-technical failure: It took more than an hour for PayPal to announce the first outage to its users. Indeed, that outage was actually resolved by the time the company’s corporate communications department announced that PayPal was down. The second outage and its resolution weren’t announced until Friday evening.
That meant it was up to major E-tailers to contact PayPal on their own to find out exactly what was happening. Even for them, it took hours after the outages began to get the necessary information and cut off PayPal functionality.
It’s understandable that many E-Commerce players are still trying to get a solid understanding of how crucial it is to keep everything running. Five- and 10-minute outages still aren’t unusual, and it’s tempting to assume that every outage will be fixed in just another minute.
But that’s a dangerous way of thinking. In PayPal’s case, it meant that big customers—who in this case were also big retailers—remained in the dark while IT people in PayPal’s datacenter assumed that the problem was about to be solved.
Like American Eagle, PayPal had a fallback plan. But it didn’t work the way it was supposed to. And though it had a technical plan (that didn’t work) for dealing with the outage, like Wal-Mart, PayPal didn’t have any plan at all for quickly notifying the people most affected (Wal-Mart’s store personnel, PayPal’s biggest E-Commerce partners).
The lesson about failed backup plans just keeps getting bigger. Yes, improbable failures can happen. When they do, failover plans can fail. And when that happens, you need a plan already in place to warn those affected in real time.
November 4th, 2010 at 10:57 am
There are two thoughts this whole incident inspires. The first is just the whole idea of backups in general. The simple answer is “practice, practice, practice”. Backup plans have to be exercised on a regular basis and they must go full circle, transferring to the backup site and also bringing services back on location.
But the other thing that comes to mind is “too big to fail”, to borrow a phrase from the financial crises. A lot of retailers are considering Cloud Computing and they should. Cloud Computing makes significant sense economically, but it also introduces a whole new set of risk factors. The backup plan becomes even more significant because the retailer is counting on their service provider to be practicing it. As processing becomes more centralized the impact of a single outage becomes more significant. At the same time, the processes necessary to ensure adequate backup are becoming more opaque. Retailers considering Cloud solutions should consider this in their evaluations.